You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2006/12/22 18:06:18 UTC

CAS Views and Sofas simplification

Using the definitions Adam defined:

* "CAS" means the entire CAS.  It never means a specific view of the CAS.
* "Index Definition" means the declaration in the descriptor that
defines an index - giving it a label, kind of index, CAS type, and
sort keys.
* "Index" is an instance of an index definition - something that can
be retreived by a getIndex() call and from which you can get an
iterator.
* "Physical Index" is an actual data structure holding references to
FeatureStructures.  This  is transparent to the user but sometimes we
need to talk about it if we're concerned about performance.

To this, let me add:
* "Index Set" - a collection of Index definition instances - or Indexes 
(for short) -
identified by a name (called the "view name").

and:
* "Sofa" - a particular Subject of Analysis. 
  -  A CAS can hold many Sofas.
  -  Annotations (subtypes of AnnotationBase) are created having a ref 
to a particular Sofa

We can approach simplicity by identifying a small number of primitive things
that can be combined to give useful interpretations.

Consider:

0) CASes are the unit of work, the unit of remote data transfer, in 
UIMA. They often
correspond to a "document" (but for big docs, may only have part of it).
1) FS's are created in the (one-and-only) CAS.
2) Annotations can be created.
   -  If there is more than one "Sofa", you must specify which Sofa they 
are "over".
3) A magic method exists for tools to get all the FS's out of a CAS 
(when serializing).
   - This magic method can be restricted to just those FS's that are 
indexed in some index,
      or which is reachable from a chain of references starting in 
another FS which is indexed.

Can we stop there (here)?  I think with these concepts we can build the 
higher level
concepts we now have, efficiently, except for the concept of subsetting 
the FS's by "index-set".

Currently, we don't have a way to define an index which is a "filter" - 
including some members
of a type, while excluding others.  An abstract example: "odd-token" and 
"even-token" - both
being "token" types, but one only holding the "odd" ones, etc.  As Thilo 
has pointed out -
the index could contain all token types, and a "filtered-iterater" could 
be used at iteration time
to sort these out, as an alternative.  There are of course space/time 
tradeoffs here. 

  - If we did have a way to define an index which is a "filter", we 
might be able to
     efficiently use this to do the same thing that index-sets enable, 
perhaps in a more general
     way.

  - Otherwise, we could use the concept of index-set to specify this filter:

4) FS's can be indexed (but don't have to be).
   -  If there is more than one "index set", you have to specify which 
"index set" to use;
      the index operations (add/remove) update only the indexes in that 
index-set.
      (Note that this doesn't fit with other ideas where a particular 
"index" might be in
       multiple index sets.  In this proposal, the only way to put an 
instance into multiple
       index sets is to do multiple adds, one per index-set.)

This doesn't have the concept of "global indexes".  If you want that, 
you can create another
"index set" and use it for that purpose.

This doesn't tie a Sofa to a View.  You could enforce some tie-in / 
restriction here if it was wanted.

This doesn't say that Annotations can only be indexed in a view which is 
(somehow) tied to
the same Sofa the Annotation is over.

The simplest solution in my mind (today :-) ) would scrap index-sets in 
favor of index-filtering.

-Marshall







Re: CAS Views and Sofas simplification

Posted by Marshall Schor <ms...@schor.com>.
Some clarifications below:

Thilo Goetz wrote:
> Marshall Schor wrote:
>> Using the definitions Adam defined:
>>
>> * "CAS" means the entire CAS.  It never means a specific view of the 
>> CAS.
>> * "Index Definition" means the declaration in the descriptor that
>> defines an index - giving it a label, kind of index, CAS type, and
>> sort keys.
>> * "Index" is an instance of an index definition - something that can
>> be retreived by a getIndex() call and from which you can get an
>> iterator.
>> * "Physical Index" is an actual data structure holding references to
>> FeatureStructures.  This  is transparent to the user but sometimes we
>> need to talk about it if we're concerned about performance.
>>
>> To this, let me add:
>> * "Index Set" - a collection of Index definition instances - or 
>> Indexes (for short) -
>> identified by a name (called the "view name").
>
> I'm not sure this wasn't settled in your discussion with Adam, but to 
> my current way of thinking, a non-anchored view is nothing but a named 
> set of indexes.  So this definition of an index set seems redundant.
OK.  I was trying to keep the concepts simpler, more circumscribed:
     CAS - a container
          -  can have a Sofa, can have more than 1 Sofa
          -  has an Index Set, can have more than 1 Index Set

    Special case: "Anchored View":  An Index set that has 1 associated Sofa

With this formulation, you can see that there are other potential 
combinations:
    non-Anchored view: An Index Set without an associated Sofa
    multi-Anchored view (?? I made up this name, not really suggesting 
it ??): An Index set with more than 1 associated Sofa.
>
> <snip>
>> 3) A magic method exists for tools to get all the FS's out of a CAS 
>> (when serializing).
>>   - This magic method can be restricted to just those FS's that are 
>> indexed in some index,
>>      or which is reachable from a chain of references starting in 
>> another FS which is indexed.
>
> I'm not quite sure what you mean here, but if this implies that this 
> magic method can also return FSs that are not indexed anywhere, I 
> don't think so.
That's OK - which is why I said the 2nd sentence.
> FSs that are not indexed are meant to be temporary and local to an 
> annotator, so no need to serialize them or do anything else with them.
Data that is *local* to an annotator is most likely never put into the 
CAS, but would rather held in "native" annotator data structures, 
because (a) it's usually more efficient, and (b) the space is reclaimed 
(of course, depending on the Annotator design, and assuming we aren't 
adding garbage-collection to the CAS).

It seems to me a more convincing use-case for this is data that was put 
into the CAS (to be shared) which some
subsequent process effectively "deleted" (e.g., changed a reference that 
was serving to locate the FS).
>
>>
>> Can we stop there (here)?  I think with these concepts we can build 
>> the higher level
>> concepts we now have, efficiently, except for the concept of 
>> subsetting the FS's by "index-set".
>>
>> Currently, we don't have a way to define an index which is a "filter" 
>> - including some members
>> of a type, while excluding others.  An abstract example: "odd-token" 
>> and "even-token" - both
>> being "token" types, but one only holding the "odd" ones, etc.  As 
>> Thilo has pointed out -
>> the index could contain all token types, and a "filtered-iterater" 
>> could be used at iteration time
>> to sort these out, as an alternative.  There are of course space/time 
>> tradeoffs here.
>>  - If we did have a way to define an index which is a "filter", we 
>> might be able to
>>     efficiently use this to do the same thing that index-sets enable, 
>> perhaps in a more general
>>     way.
>
> More general how?
Multiple Index sets are an additional mechanism (compared with not 
having multiple index sets).  They provide
a way to say a FS is a "member" of some index set. 

A more general approach (one which doesn't add any new mechanisms to 
having a single Index Set) would be to
get rid of multiple index sets, and say if users want to make FSs 
"members" of some user-defined "sets" (called
views), they can do that using normal indexes (assuming we have added 
the ability to define indexes which
filter).   Here's how it could work:
   - Simple case: User desires FSs to be members of 1 view:
       - User defines additional structures to support the way they want 
to refer to these.
          - e.g. an additional slot per FS, with a ref to a "view" 
object they define.
          - User defines an index over the type they want in the view, 
with the extra predicate that
             slot  value == the view object

This may not be such a good idea because it requires additional slot / 
FS, and it requires some "management" of
view names to be able to specify the equal test.

So - I think I would come down in favor of having multiple named index 
sets as a better approach here.
Just think of this thread as an intellectual exploration of 
possibilities, not as something I'm advocating :-)

>
>
>>
>>  - Otherwise, we could use the concept of index-set to specify this 
>> filter:
>>
>> 4) FS's can be indexed (but don't have to be).
>>   -  If there is more than one "index set", you have to specify which 
>> "index set" to use;
>>      the index operations (add/remove) update only the indexes in 
>> that index-set.
>>      (Note that this doesn't fit with other ideas where a particular 
>> "index" might be in
>>       multiple index sets.  In this proposal, the only way to put an 
>> instance into multiple
>>       index sets is to do multiple adds, one per index-set.)
>
> That's what we have views for, isn't it?
Right.  I think these are equivalent.  A view corresponds to a (named) 
index-set.
>
>>
>> This doesn't have the concept of "global indexes".  If you want that, 
>> you can create another
>> "index set" and use it for that purpose.
> <snip>
>
> The set of all indexes must be accessible to the user in the CAS, 
> otherwise we violate the "all data must be accessible from the CAS 
> without recourse to views" constraint.
I think this is satisfied by my so-called "magic method".  This would 
likely be implemented as suggested above. 
Most *typical users* probably would not use APIs which iterate through 
all indexes, but framework & tooling would. 

-Marshall


Re: CAS Views and Sofas simplification

Posted by Thilo Goetz <tw...@gmx.de>.
Marshall Schor wrote:
> Using the definitions Adam defined:
> 
> * "CAS" means the entire CAS.  It never means a specific view of the CAS.
> * "Index Definition" means the declaration in the descriptor that
> defines an index - giving it a label, kind of index, CAS type, and
> sort keys.
> * "Index" is an instance of an index definition - something that can
> be retreived by a getIndex() call and from which you can get an
> iterator.
> * "Physical Index" is an actual data structure holding references to
> FeatureStructures.  This  is transparent to the user but sometimes we
> need to talk about it if we're concerned about performance.
> 
> To this, let me add:
> * "Index Set" - a collection of Index definition instances - or Indexes 
> (for short) -
> identified by a name (called the "view name").

I'm not sure this wasn't settled in your discussion with Adam, but to my 
current way of thinking, a non-anchored view is nothing but a named set 
of indexes.  So this definition of an index set seems redundant.

<snip>
> 3) A magic method exists for tools to get all the FS's out of a CAS 
> (when serializing).
>   - This magic method can be restricted to just those FS's that are 
> indexed in some index,
>      or which is reachable from a chain of references starting in 
> another FS which is indexed.

I'm not quite sure what you mean here, but if this implies that this 
magic method can also return FSs that are not indexed anywhere, I don't 
think so.  FSs that are not indexed are meant to be temporary and local 
to an annotator, so no need to serialize them or do anything else with them.

> 
> Can we stop there (here)?  I think with these concepts we can build the 
> higher level
> concepts we now have, efficiently, except for the concept of subsetting 
> the FS's by "index-set".
> 
> Currently, we don't have a way to define an index which is a "filter" - 
> including some members
> of a type, while excluding others.  An abstract example: "odd-token" and 
> "even-token" - both
> being "token" types, but one only holding the "odd" ones, etc.  As Thilo 
> has pointed out -
> the index could contain all token types, and a "filtered-iterater" could 
> be used at iteration time
> to sort these out, as an alternative.  There are of course space/time 
> tradeoffs here.
>  - If we did have a way to define an index which is a "filter", we might 
> be able to
>     efficiently use this to do the same thing that index-sets enable, 
> perhaps in a more general
>     way.

More general how?

> 
>  - Otherwise, we could use the concept of index-set to specify this filter:
> 
> 4) FS's can be indexed (but don't have to be).
>   -  If there is more than one "index set", you have to specify which 
> "index set" to use;
>      the index operations (add/remove) update only the indexes in that 
> index-set.
>      (Note that this doesn't fit with other ideas where a particular 
> "index" might be in
>       multiple index sets.  In this proposal, the only way to put an 
> instance into multiple
>       index sets is to do multiple adds, one per index-set.)

That's what we have views for, isn't it?

> 
> This doesn't have the concept of "global indexes".  If you want that, 
> you can create another
> "index set" and use it for that purpose.
<snip>

The set of all indexes must be accessible to the user in the CAS, 
otherwise we violate the "all data must be accessible from the CAS 
without recourse to views" constraint.

--Thilo




Re: CAS Views and Sofas simplification

Posted by Adam Lally <al...@alum.rpi.edu>.
More later (time to take stock of where we are again, I think...) but first:

On 12/27/06, Thilo Goetz <tw...@gmx.de> wrote:
> Adam Lally wrote:
> > On 12/22/06, Marshall Schor <ms...@schor.com> wrote:
> <snip>
> >> > Also, we have some uses of non-annotation indexes that are segregated
> >> > by Sofa (say, a Lemma index that's particular to a Sofa, where there's
> >> > actually no explicit link from the Lemma to the Sofa).  A filtering
> >> > approach wouldn't work there,
> >> It could be made to work by adding a feature to the Lemma type which was
> >> a sofa reference.  But maybe that's asking too much of the user?
> >
> > I'm not sure what is right here... this is a reasonable idea.  But I
> > think in the absence of a clear sense of what is best I lean towards
> > staying closer to where were currently are, which is to have view
> > where the user explicitly decides which view to index things in.
>
> The whole point of those views, I thought, was to be able to segregate
> the data.  So if you want lemmas for a certain view to be separate from
> the lemmas for different views, you should be able to achieve that with
> a lemma index that is specific to that view.

So you're agreeing with me, I think. (I'm the "> >" and the "> >> >" :)


>If you want to share
> lemmas from two views, share the index between the views.  That's my
> mental model of how things should work.  I like this better than adding
> sofa references for the following reasons:
>
> a) more space efficient, as there's not extra sofa references
> b) more time efficient, as you don't need to check the sofa references
> at indexing time
> c) no more complicated, as the user needs to reference something, the
> view or the sofa.
>
> This is how I would have done annotations as well.  Maybe there are
> considerations that I'm not aware of, but I see no benefit to each
> annotation knowing what sofa it references.

Well, I think the main reason we did this is so that we could
implement Annotation.getCoveredText().

Also we have the use case where we're doing translation and we have
Annotations in the translated text that point back to corresponding
Annotations in the original text.  So if you're walking the Annotation
index in the translated text and follow references that get you to
another Annotation, how are you supposed to know which Sofa the
Annotation you're looking at is supposed to be annotating?

To me, just looking at this from a data modeling perspective, the
purpose of an Annotation is to indicate some span of text, so it makes
sense to model it with a reference to that text.  But I suppose other
interpretations are possible.

> Of course that would make a view-less approach from the global CAS that
> much harder...
>

Impossible, I think.  We need to answer the question: are views a
fundamental way of interacting with the CAS (*any* CAS implementation
now or in the future, including raw XML manipulation) or not?

The UIMA spec proposal says not, and there's at least one vocal
proponent of that approach (Dan Gruhl).  We of course could decide to
become vocal proponents of the other approach, but it's not just
ourselves we need to convince.

-Adam

Re: CAS Views and Sofas simplification

Posted by Thilo Goetz <tw...@gmx.de>.
Adam Lally wrote:
> On 12/22/06, Marshall Schor <ms...@schor.com> wrote:
<snip>
>> > Also, we have some uses of non-annotation indexes that are segregated
>> > by Sofa (say, a Lemma index that's particular to a Sofa, where there's
>> > actually no explicit link from the Lemma to the Sofa).  A filtering
>> > approach wouldn't work there,
>> It could be made to work by adding a feature to the Lemma type which was
>> a sofa reference.  But maybe that's asking too much of the user?
> 
> I'm not sure what is right here... this is a reasonable idea.  But I
> think in the absence of a clear sense of what is best I lean towards
> staying closer to where were currently are, which is to have view
> where the user explicitly decides which view to index things in.

The whole point of those views, I thought, was to be able to segregate 
the data.  So if you want lemmas for a certain view to be separate from 
the lemmas for different views, you should be able to achieve that with 
a lemma index that is specific to that view.  If you want to share 
lemmas from two views, share the index between the views.  That's my 
mental model of how things should work.  I like this better than adding 
sofa references for the following reasons:

a) more space efficient, as there's not extra sofa references
b) more time efficient, as you don't need to check the sofa references 
at indexing time
c) no more complicated, as the user needs to reference something, the 
view or the sofa.

This is how I would have done annotations as well.  Maybe there are 
considerations that I'm not aware of, but I see no benefit to each 
annotation knowing what sofa it references.  If the annotation is 
indexed in a certain anchored view, the sofa of that view is what it 
references.  I understand that there may be data structures that need to 
reference sofas explicitly, but I don't think annotations qualify, nor 
do lemmas.

Of course that would make a view-less approach from the global CAS that 
much harder...

> 
> 
>> > So basically, is this equivalent to taking our current implemenation
>> > of View and saying that the sofa is optional? (Which is more or less
>> > what the UIMA spec says.)
>> Well, it allows 2 or more Sofas to be indexed using a single
>> index-set (i.e., in a single view), which
>> the current design doesn't.
> 
> My idea of how to do this would be to create a View without any Sofa
> (a non-anchored view), and then you could add any annotations that you
> want to it.  There's no restriction on adding annotations to a
> non-anchored view, the only restriction that we might have would be on
> adding annotations to the "wrong" anchored view.

The user should be responsible for adding annotations to the correct 
view.  When dealing with annotations in anchored views, the user should 
not have to worry about sofas at all, and neither should the index.  As 
annotations need to be created on the view, it seems only natural that 
they will be indexed on the same view.  If somebody insists on indexing 
an annotation on the wrong view, that is their problem.  You can also 
create annotations with the wrong sofa, and there is no checking that 
will prevent that.

--Thilo




Re: CAS Views and Sofas simplification

Posted by Adam Lally <al...@alum.rpi.edu>.
On 12/22/06, Marshall Schor <ms...@schor.com> wrote:
> > Also why did you say "(when serializing)" - is it intended that this
> > operation not be used for other purposes such as by an annotator?
> This was me thinking that the main use-case for this is serialization, and
> remembering you wanted to hide this from users because they might abuse it?

Actually I suggested that making this operation available to users was
a way of enabling access to FS in the CAS without knowledge of views,
which I liked.  The only operation I remember wanting to hide/disable
was something that added an FS to all views.


> What I was trying to say was that there might be many annotation
> indexes.  Each one might have a
> "filter" saying that it should have annotations whose "sofa" was a
> particular sofa, for example.
>
OK, I see, that seems better.


> > Also, we have some uses of non-annotation indexes that are segregated
> > by Sofa (say, a Lemma index that's particular to a Sofa, where there's
> > actually no explicit link from the Lemma to the Sofa).  A filtering
> > approach wouldn't work there,
> It could be made to work by adding a feature to the Lemma type which was
> a sofa reference.  But maybe that's asking too much of the user?

I'm not sure what is right here... this is a reasonable idea.  But I
think in the absence of a clear sense of what is best I lean towards
staying closer to where were currently are, which is to have view
where the user explicitly decides which view to index things in.


> > So basically, is this equivalent to taking our current implemenation
> > of View and saying that the sofa is optional? (Which is more or less
> > what the UIMA spec says.)
> Well, it allows 2 or more Sofas to be indexed using a single
> index-set (i.e., in a single view), which
> the current design doesn't.

My idea of how to do this would be to create a View without any Sofa
(a non-anchored view), and then you could add any annotations that you
want to it.  There's no restriction on adding annotations to a
non-anchored view, the only restriction that we might have would be on
adding annotations to the "wrong" anchored view.

-Adam

Re: CAS Views and Sofas simplification

Posted by Marshall Schor <ms...@schor.com>.
Adam Lally wrote:
> On 12/22/06, Marshall Schor <ms...@schor.com> wrote:
>> 0) CASes are the unit of work, the unit of remote data transfer, in
>> UIMA. They often
>> correspond to a "document" (but for big docs, may only have part of it).
>> 1) FS's are created in the (one-and-only) CAS.
>> 2) Annotations can be created.
>>    -  If there is more than one "Sofa", you must specify which Sofa they
>> are "over".
>> 3) A magic method exists for tools to get all the FS's out of a CAS
>> (when serializing).
>>    - This magic method can be restricted to just those FS's that are
>> indexed in some index,
>>       or which is reachable from a chain of references starting in
>> another FS which is indexed.
>>
>> Can we stop there (here)?  I think with these concepts we can build the
>> higher level
>> concepts we now have, efficiently, except for the concept of subsetting
>> the FS's by "index-set".
>>
>
> Hmmm.. don't you need the "FS can be indexed (but don't have to be)"
> part in here?  You refer to FS being indexed but don't say how they
> get there.
Right - I think I forgot to include at least one index-set.
>
> Also why did you say "(when serializing)" - is it intended that this
> operation not be used for other purposes such as by an annotator?
This was me thinking that the main use-case for this is serialization, and
remembering you wanted to hide this from users because they might abuse it?
>> Currently, we don't have a way to define an index which is a "filter" -
>> including some members
>> of a type, while excluding others.  An abstract example: "odd-token" and
>> "even-token" - both
>> being "token" types, but one only holding the "odd" ones, etc.  As Thilo
>> has pointed out -
>> the index could contain all token types, and a "filtered-iterater" could
>> be used at iteration time
>> to sort these out, as an alternative.  There are of course space/time
>> tradeoffs here.
>>
>>   - If we did have a way to define an index which is a "filter", we
>> might be able to
>>      efficiently use this to do the same thing that index-sets enable,
>> perhaps in a more general
>>      way.
>>
>
> So let me see if I have this right - there would be just one
> annotation index, sorted on begin, end.  
What I was trying to say was that there might be many annotation 
indexes.  Each one might have a
"filter" saying that it should have annotations whose "sofa" was a 
particular sofa, for example.

> All indexed annotations for
> any Sofa in the CAS would exist in this one index.  If an annotator
> wanted to do the usual operation of iterating over annotations
> relating to one particular Sofa, this would be done using a filtered
> iterator that would filter out any annotations not referring to the
> specified Sofa.  Correct?
See above...  Could be done this way, but I was thinking that the 
filtering would
be done at indexing time, not at iteration time.
>
> One thing that comes to mind is that it may be more efficient to keep
> the annotation index segregated by Sofa as we do today.  That's
> because I presume no one will actually care about the relative
> ordering of annotations from different Sofas, so we'd be wasting time
> if we computed it.  
Right - why I was thinking it would be done at indexing time.
> And, we currently benefit from the fact that
> annotations are usually created in order, but we'd lose that benefit
> if we had an index that interleaved annotations across Sofas.
>
> Also, we have some uses of non-annotation indexes that are segregated
> by Sofa (say, a Lemma index that's particular to a Sofa, where there's
> actually no explicit link from the Lemma to the Sofa).  A filtering
> approach wouldn't work there, 
It could be made to work by adding a feature to the Lemma type which was
a sofa reference.  But maybe that's asking too much of the user?
> although perhaps we can argue that those
> cases are poor design.
>
>>   - Otherwise, we could use the concept of index-set to specify this 
>> filter:
>>
>> 4) FS's can be indexed (but don't have to be).
>>    -  If there is more than one "index set", you have to specify which
>> "index set" to use;
>>       the index operations (add/remove) update only the indexes in that
>> index-set.
>>       (Note that this doesn't fit with other ideas where a particular
>> "index" might be in
>>        multiple index sets.  In this proposal, the only way to put an
>> instance into multiple
>>        index sets is to do multiple adds, one per index-set.)
>>
>> This doesn't have the concept of "global indexes".  If you want that,
>> you can create another
>> "index set" and use it for that purpose.
>>
>> This doesn't tie a Sofa to a View.  You could enforce some tie-in /
>> restriction here if it was wanted.
>>
>
> So basically, is this equivalent to taking our current implemenation
> of View and saying that the sofa is optional? (Which is more or less
> what the UIMA spec says.)
Well, it allows 2 or more Sofas to be indexed using a single
index-set (i.e., in a single view), which
the current design doesn't.
>> This doesn't say that Annotations can only be indexed in a view which is
>> (somehow) tied to
>> the same Sofa the Annotation is over.
>>
>
> To clarify on the anchored view constraint that the UIMA Spec talks
> about - it is not quite what you say here.  You can add any Annotation
> to a non-anchored view (one that has no sofa).  You just cannot add an
> Annotation to an view that's anchored to a _different_ sofa than the
> Sofa that the Annotation points to.
> Whether this constraint is checked or not is up to the implementation.
> So if think this just comes down to how we feel about the performance
> of the check.  But it's still important to understand the concept -
> the intention of the anchored view is to segregate things by Sofa, and
> its valid for downstream annotators to rely on this.  A framework may
> not check, but an annotator that violates the anchored view constraint
> it is a badly behaved annotator.
-Marshall

Re: CAS Views and Sofas simplification

Posted by Adam Lally <al...@alum.rpi.edu>.
On 12/22/06, Marshall Schor <ms...@schor.com> wrote:
> 0) CASes are the unit of work, the unit of remote data transfer, in
> UIMA. They often
> correspond to a "document" (but for big docs, may only have part of it).
> 1) FS's are created in the (one-and-only) CAS.
> 2) Annotations can be created.
>    -  If there is more than one "Sofa", you must specify which Sofa they
> are "over".
> 3) A magic method exists for tools to get all the FS's out of a CAS
> (when serializing).
>    - This magic method can be restricted to just those FS's that are
> indexed in some index,
>       or which is reachable from a chain of references starting in
> another FS which is indexed.
>
> Can we stop there (here)?  I think with these concepts we can build the
> higher level
> concepts we now have, efficiently, except for the concept of subsetting
> the FS's by "index-set".
>

Hmmm.. don't you need the "FS can be indexed (but don't have to be)"
part in here?  You refer to FS being indexed but don't say how they
get there.

Also why did you say "(when serializing)" - is it intended that this
operation not be used for other purposes such as by an annotator?


> Currently, we don't have a way to define an index which is a "filter" -
> including some members
> of a type, while excluding others.  An abstract example: "odd-token" and
> "even-token" - both
> being "token" types, but one only holding the "odd" ones, etc.  As Thilo
> has pointed out -
> the index could contain all token types, and a "filtered-iterater" could
> be used at iteration time
> to sort these out, as an alternative.  There are of course space/time
> tradeoffs here.
>
>   - If we did have a way to define an index which is a "filter", we
> might be able to
>      efficiently use this to do the same thing that index-sets enable,
> perhaps in a more general
>      way.
>

So let me see if I have this right - there would be just one
annotation index, sorted on begin, end.  All indexed annotations for
any Sofa in the CAS would exist in this one index.  If an annotator
wanted to do the usual operation of iterating over annotations
relating to one particular Sofa, this would be done using a filtered
iterator that would filter out any annotations not referring to the
specified Sofa.  Correct?

One thing that comes to mind is that it may be more efficient to keep
the annotation index segregated by Sofa as we do today.  That's
because I presume no one will actually care about the relative
ordering of annotations from different Sofas, so we'd be wasting time
if we computed it.  And, we currently benefit from the fact that
annotations are usually created in order, but we'd lose that benefit
if we had an index that interleaved annotations across Sofas.

Also, we have some uses of non-annotation indexes that are segregated
by Sofa (say, a Lemma index that's particular to a Sofa, where there's
actually no explicit link from the Lemma to the Sofa).  A filtering
approach wouldn't work there, although perhaps we can argue that those
cases are poor design.


>   - Otherwise, we could use the concept of index-set to specify this filter:
>
> 4) FS's can be indexed (but don't have to be).
>    -  If there is more than one "index set", you have to specify which
> "index set" to use;
>       the index operations (add/remove) update only the indexes in that
> index-set.
>       (Note that this doesn't fit with other ideas where a particular
> "index" might be in
>        multiple index sets.  In this proposal, the only way to put an
> instance into multiple
>        index sets is to do multiple adds, one per index-set.)
>
> This doesn't have the concept of "global indexes".  If you want that,
> you can create another
> "index set" and use it for that purpose.
>
> This doesn't tie a Sofa to a View.  You could enforce some tie-in /
> restriction here if it was wanted.
>

So basically, is this equivalent to taking our current implemenation
of View and saying that the sofa is optional? (Which is more or less
what the UIMA spec says.)


> This doesn't say that Annotations can only be indexed in a view which is
> (somehow) tied to
> the same Sofa the Annotation is over.
>

To clarify on the anchored view constraint that the UIMA Spec talks
about - it is not quite what you say here.  You can add any Annotation
to a non-anchored view (one that has no sofa).  You just cannot add an
Annotation to an view that's anchored to a _different_ sofa than the
Sofa that the Annotation points to.

Whether this constraint is checked or not is up to the implementation.
 So if think this just comes down to how we feel about the performance
of the check.  But it's still important to understand the concept -
the intention of the anchored view is to segregate things by Sofa, and
its valid for downstream annotators to rely on this.  A framework may
not check, but an annotator that violates the anchored view constraint
it is a badly behaved annotator.


-Adam