You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2011/10/01 00:36:33 UTC
Re: CAS Id
slight generalization...
For texts, the CAS == unit of work flowing in UIMA == (typically) a "document"
But, UIMA is used for other kinds of unstructured data, such as audio, video,
image, etc. In this case the CAS == unit of work flowing in UIMA != a "document"...
We might want to consider more generic naming, because of this, like Jörn's
"CasId". So in the following a name like CAS.setId() or CAS.setIdUri() might be
better (dropping "Document").
-Marshall
On 9/30/2011 10:59 AM, Richard Eckart de Castilho wrote:
> I always thought that a CAS.setDocumentUri() would have been helpful. In the beginning I mistook setSofaDataUri() to be such a thing and was quite surprise that if I set that, I cannot set the document text anymore.
>
> So how about adding a setDocumentUri() method to CAS?
>
> From the experience with our own type system which supports such things, we find that it is also very useful to have a documentBaseUri for cases where recursive processing is taking place. I find a simple ID is not enough in many cases, e.g. when recursively reading files from one directory and writing them to another one while preserving the relative hierarchy.
>
> So a setDocumentBaseUri() in my opinion would also be desirable.
>
> Cheers,
>
> -- Richard
>
> Am 30.09.2011 um 16:53 schrieb Jörn Kottmann:
>
>> On 9/30/11 4:38 PM, Marshall Schor wrote:
>>> Can you say a bit more what this is?
>>>
>> Sure. The intent of the ID field is to reference a CAS instance to
>> another system.
>>
>> Lets say we have an application where a UIMA analysis pipeline is used
>> to process documents
>> which are stored in a database there you need to write the IDs of the
>> documents into the CAS,
>> otherwise it is not possible to write analysis results back to the database.
>>
>> So typically your collection reader or first AE in the pipeline will set
>> the ID and the last AE in the
>> pipeline will use it again to save the analysis results.
>>
>> Currently you always need to define a FS which holds your custom ID, but
>> I guess a generic
>> string ID field would be just fine for almost any use case.
>>
>> Jörn
Re: CAS Id
Posted by Richard Eckart de Castilho <ec...@tk.informatik.tu-darmstadt.de>.
Am 01.10.2011 um 00:36 schrieb Marshall Schor:
> slight generalization...
>
> For texts, the CAS == unit of work flowing in UIMA == (typically) a "document"
>
> But, UIMA is used for other kinds of unstructured data, such as audio, video,
> image, etc. In this case the CAS == unit of work flowing in UIMA != a "document"...
I am not a native speaker of English, but I suppose one could argue that there are "video documents" and "audio documents".
> We might want to consider more generic naming, because of this, like Jörn's
> "CasId". So in the following a name like CAS.setId() or CAS.setIdUri() might be
> better (dropping "Document").
>
> -Marshall
I would like to know if the ID should be per Sofa (per view) or really per (Base?)CAS, i.e. identifying the actual overall underlying data container.
--
-------------------------------------------------------------------
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab
FB 20 Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckartde@tk.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------