You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Neal R Lewis <nr...@us.ibm.com> on 2013/02/01 00:33:32 UTC

Summary of CAS Store Functionality

Hello All, 

Thank you again for all of your responses about the UIMA CAS Store. I'm glad that you were interested in this topic, and I would like to submit another summary to see if we can concisely define what would be requirements interfacing with a CAS Store. 

We talked a bit about implementation (Binary vs XMI, DB vs File system), but I would like to first  discuss an interface for a CAS Store.  The reason being is that it seems while there is consistent functionality in a CAS store, there might be different  implementation constraints / preferences.   I'll try to be concise, and if you would like to comment, please do so. 

Implementation: 
  - Compatible with current UIMA implementations (UIMAj, UIMACpp, UIMAFit) 
  - Well defined API
  - Documentation 

Functionality: 
  - Accessible from a  Web Service (SOAP / REST)
  - Maintain Collection of CASes
  - INSERT / DELETE/ UPDATE / READ CASes
  - INSERT / DELETE/ UPDATE / READ Cas Fragments   (Objects within a CAS)
  - READ FSes produced by a certain annotator across all CASes in all collections or in a certain collection
  - Query CASes that already have annotations
  - Use stable identification of CAS

As for the identification of CASes and objects within, I would like to push the idea of a Feature Structure ID, as I've written about before.  Were there any other thoughts / suggestions about such an object?  



Re: Summary of CAS Store Functionality

Posted by Marshall Schor <ms...@schor.com>.
Hi,

At the risk of maybe discussing things that have been previously discussed :-) ,
here's some thoughts.  I'm thinking (mainly) from the perspective of UIMA
processing the extracts of a CAS Store.  One could, of course, also imagine
non-UIMA kinds of processing of extracts of a CAS Store - e.g., count the number
of annotations of a certain kind in the store.

=========
Re:  Globally Unique Ids (GUID) for CASes and FeatureStructures (FSs).
Since FSs are associated with a particular CAS, maybe it is useful to think of
the GUIDs as 2 parts: 1) a GUID for the CAS itself, plus 2) some scheme to
number each Feature Structure in the CAS.
In this approach, the FS part of the GUID could in a majority of the cases be a
1-word int (although some 'escape' for the rare case where more than FSs in a
CAS could exist (over time) exceeded the limits imposed by 1-word).

=========
Re: Loading parts of a particular CAS (e.g., a "projection" via some kind of
query, such as all Feature Structure of types X, Y, or Z). 
  - Feature Structures can have references to other FSs
  - Feature Structures can be associated with a SofA - for instance, an
annotation over text, using its begin / end values to get the "covered" text.

When thinking about "loading" some part of a CAS via a projection, one has to
consider whether or not to load the SofA associated with it, and whether or not
to load referenced FSs (recursively, perhaps, as well).  If the referenced FSs
were *not* loaded, we could imagine replacing the references with a special
value which indicated it was (a) not loaded, and (b) had the FS id part - to
enable "lazy" loading (if dereferenced).

=========
Re: Loading parts of a CAS - indexing FSs (or not).  When a FS is loaded, a
decision has to be made - should it be "added to the indexes" or not?  Adding to
indexes can be an expensive operation (depending on the indexes, etc.).  If the
particular FS is one that is only located by dereferencing a FS reference, then
it won't need to be in the indexes (an efficiency optimization). 

As an example, consider the built-in Feature Structure supporting lists:
uima.cas.FSList and uima.cas.EmptyFSList.  These are unlikely to be indexed, and
when loaded, they probably should not be indexed (for efficiency). 

The existing UIMA serialization code records which FSs should be indexed upon
loading, and which shouldn't.  This information is kept *per view* - that is, a
FS could be indexed in one view, and not in another view. This information
should probably be kept with the FS in a Cas Store, so later loading could do
the right thing.

=========
Re: FS reference to another FS in a different CAS - This is not currently
supported, and there may be lots of issues to think through to do this in a
general manner, with the right efficiency tradeoffs.

=========
Re: reading collections of FSs from collections of CASes. This would happen, I
think, in the use-case described below as "READ FSes produced by a certain
annotator across all CASes in all collections or in a certain collection".

There are maybe two sub use-cases. 
  - (u1) One is where the READ is being done by some application outside of UIMA.
  - (u2) The other is where the intent is to run a UIMA pipeline over this
collection.  This has 2 sub cases:
    -- (u2a) One where each set of FSs associated with one particular CAS is
processed as a (partial) load of that CAS, and multiple of these (partially)
loaded CASes are processed.
    -- (u2b) One where all of the FSs associated with all of the CASes are
loaded together into one new CAS (having of course a new CAS Id).

If the FS is "isolated", meaning that it has no reference to a SofA, or other FS
references, then a new CAS could be constructed with these FSs loaded.  The
"unique" FSid (consisting the GUID for the CAS + the FSid) would change, because
the CAS they were loaded into would have a new GUID.

But, if the FS is not "isolated", then if the use case envisions accessing that
FS's "covered text", for example, this would only fit into a CAS structure if
each loaded FS referring to a different SofA, went into a separate view, and
SofAs for each view were added.

Likewise, if the FS is not "isolated", and some number of the FS refs wanted to
be dereferenced, then those references would be to FSs in different CASes.  If
these were loaded into one CAS, either (a) they would lose their CAS-association
identity, or (b) we would have an FS reference to another FS in a different CAS.

So, perhaps the underlying assumption for this use-case is either (u1) or (u2a)
- avoiding (u2b) and its issues. Is that what is envisioned?

-Marshall

On 1/31/2013 6:33 PM, Neal R Lewis wrote:
> Hello All, 
>
> Thank you again for all of your responses about the UIMA CAS Store. I'm glad that you were interested in this topic, and I would like to submit another summary to see if we can concisely define what would be requirements interfacing with a CAS Store. 
>
> We talked a bit about implementation (Binary vs XMI, DB vs File system), but I would like to first  discuss an interface for a CAS Store.  The reason being is that it seems while there is consistent functionality in a CAS store, there might be different  implementation constraints / preferences.   I'll try to be concise, and if you would like to comment, please do so. 
>
> Implementation: 
>   - Compatible with current UIMA implementations (UIMAj, UIMACpp, UIMAFit) 
>   - Well defined API
>   - Documentation 
>
> Functionality: 
>   - Accessible from a  Web Service (SOAP / REST)
>   - Maintain Collection of CASes
>   - INSERT / DELETE/ UPDATE / READ CASes
>   - INSERT / DELETE/ UPDATE / READ Cas Fragments   (Objects within a CAS)
>   - READ FSes produced by a certain annotator across all CASes in all collections or in a certain collection
>   - Query CASes that already have annotations
>   - Use stable identification of CAS
>
> As for the identification of CASes and objects within, I would like to push the idea of a Feature Structure ID, as I've written about before.  Were there any other thoughts / suggestions about such an object?  
>
>
>


Re: Summary of CAS Store Functionality

Posted by Renaud Richardet <re...@epfl.ch>.
+1 for ID for feature structures as well.

On Mon, Feb 4, 2013 at 3:59 PM, Neal Lewis <im...@gmail.com> wrote:
> Yes.  It currently isn't inline with OASIS spec, but this is something that
> we're preparing to suggest as an extension. I'm trying to get generic
> feedback from the community right now as to how  / what this could look
> like. We have some documentation that we'll be sending out for it.
>
>
>
> On 02/04/2013 02:21 AM, Tommaso Teofili wrote:
>>
>> 2013/2/3 Richard Eckart de Castilho
>> <ec...@ukp.informatik.tu-darmstadt.de>
>>
>>> Am 01.02.2013 um 00:33 schrieb Neal R Lewis <nr...@us.ibm.com>:
>>>
>>>> As for the identification of CASes and objects within, I would like to
>>>
>>> push the idea of a Feature Structure ID, as I've written about before.
>>>   Were there any other thoughts / suggestions about such an object?
>>>
>>> I think an ID for feature structures would be good for the CAS in
>>> general.
>>> Do you suggest to extend the UIMA FeatureStructure to contain an ID?
>>>
>> If I recall correctly that's been discussed a number of times, I'd be
>> personally in favor of such a thing but I wonder how that'd relate with
>> the
>> OASIS UIMA spec, would that break the spec?
>>
>> Tommaso
>>
>>
>>
>>
>>> -- Richard
>>>
>>> --
>>> -------------------------------------------------------------------
>>> Richard Eckart de Castilho
>>> Technical Lead
>>> Ubiquitous Knowledge Processing Lab (UKP-TUD)
>>> FB 20 Computer Science Department
>>> Technische Universität Darmstadt
>>> Hochschulstr. 10, D-64289 Darmstadt, Germany
>>> phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
>>> eckart@ukp.informatik.tu-darmstadt.de
>>> www.ukp.tu-darmstadt.de
>>> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
>>> -------------------------------------------------------------------
>>>
>>>
>



-- 
Renaud Richardet
Blue Brain Project  PhD candidate
EPFL  Station 15
CH-1015 Lausanne
phone: +41-78-675-9501
http://people.epfl.ch/renaud.richardet

Re: Summary of CAS Store Functionality

Posted by Neal Lewis <im...@gmail.com>.
Yes.  It currently isn't inline with OASIS spec, but this is something 
that we're preparing to suggest as an extension. I'm trying to get 
generic feedback from the community right now as to how  / what this 
could look like. We have some documentation that we'll be sending out 
for it.


On 02/04/2013 02:21 AM, Tommaso Teofili wrote:
> 2013/2/3 Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>
>
>> Am 01.02.2013 um 00:33 schrieb Neal R Lewis <nr...@us.ibm.com>:
>>
>>> As for the identification of CASes and objects within, I would like to
>> push the idea of a Feature Structure ID, as I've written about before.
>>   Were there any other thoughts / suggestions about such an object?
>>
>> I think an ID for feature structures would be good for the CAS in general.
>> Do you suggest to extend the UIMA FeatureStructure to contain an ID?
>>
> If I recall correctly that's been discussed a number of times, I'd be
> personally in favor of such a thing but I wonder how that'd relate with the
> OASIS UIMA spec, would that break the spec?
>
> Tommaso
>
>
>
>
>> -- Richard
>>
>> --
>> -------------------------------------------------------------------
>> Richard Eckart de Castilho
>> Technical Lead
>> Ubiquitous Knowledge Processing Lab (UKP-TUD)
>> FB 20 Computer Science Department
>> Technische Universität Darmstadt
>> Hochschulstr. 10, D-64289 Darmstadt, Germany
>> phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
>> eckart@ukp.informatik.tu-darmstadt.de
>> www.ukp.tu-darmstadt.de
>> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
>> -------------------------------------------------------------------
>>
>>


Re: Summary of CAS Store Functionality

Posted by Tommaso Teofili <to...@gmail.com>.
2013/2/3 Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>

> Am 01.02.2013 um 00:33 schrieb Neal R Lewis <nr...@us.ibm.com>:
>
> > As for the identification of CASes and objects within, I would like to
> push the idea of a Feature Structure ID, as I've written about before.
>  Were there any other thoughts / suggestions about such an object?
>
> I think an ID for feature structures would be good for the CAS in general.
> Do you suggest to extend the UIMA FeatureStructure to contain an ID?
>

If I recall correctly that's been discussed a number of times, I'd be
personally in favor of such a thing but I wonder how that'd relate with the
OASIS UIMA spec, would that break the spec?

Tommaso




>
> -- Richard
>
> --
> -------------------------------------------------------------------
> Richard Eckart de Castilho
> Technical Lead
> Ubiquitous Knowledge Processing Lab (UKP-TUD)
> FB 20 Computer Science Department
> Technische Universität Darmstadt
> Hochschulstr. 10, D-64289 Darmstadt, Germany
> phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
> eckart@ukp.informatik.tu-darmstadt.de
> www.ukp.tu-darmstadt.de
> Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
> -------------------------------------------------------------------
>
>

Re: Summary of CAS Store Functionality

Posted by Richard Eckart de Castilho <ec...@ukp.informatik.tu-darmstadt.de>.
Am 01.02.2013 um 00:33 schrieb Neal R Lewis <nr...@us.ibm.com>:

> As for the identification of CASes and objects within, I would like to push the idea of a Feature Structure ID, as I've written about before.  Were there any other thoughts / suggestions about such an object?  

I think an ID for feature structures would be good for the CAS in general. Do you suggest to extend the UIMA FeatureStructure to contain an ID? 

-- Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab (UKP-TUD) 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone [+49] (0)6151 16-7477, fax -5455, room S2/02/B117
eckart@ukp.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------