You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2016/09/01 14:09:30 UTC

opinion on degree of backwards compatibility for Uima V3 experiment

UIMA V3 implementation includes in many places extra code (takes time / space)
whose goal is to make things look closer to version 2.  Some of this is for
interoperability with version 2 artifacts, like serialized forms.

An example: in v2, many serialization forms include "references" to other
Feature Structures (FSs), and for those, the encoding is the "address" in the
heap of the FS.

In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
int which increments by 1.  This mis-matches the "address" in v2, so many parts
of the serialization code builds a map at serialization time from the v3 id's to
v2 "addresses", and uses the latter in the serialization form.

Currently, this is done for various binary serializations, so that these can be
read back in by v2 code.

Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
the serialized forms for these differ between v2 and v3, in that the numbers
used to represent references to other FSs are different.

The deserialization code for XMI and JSON doesn't depend on these numbers being
anything other than unique per FS, so there's no issue in deserializing.  But
the UIMA community may have built other things that depend on these identifiers
not changing. 

What's your opinion: should the XMI and JSON etc serialization in V3 be changed
to reproduce (approximately) the same reference numbers as v2?  I say
approximately, because other factors might affect these, such as the ordering
for things not in "ordered" indexes, etc. between v2 and v3.

-Marshall

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

whew! -M


On 9/2/2016 9:27 AM, Peter Klgl wrote:
> Tested all formats, did not happen for a reasonable complex CAS.
>
>
> Am 02.09.2016 um 15:26 schrieb Marshall Schor:
>> Re: deserializing the same CAS twice shouldn't change the addresses;  if you
>> have a case where it's doing that, I'll investigate (need a small test case...).
>>
>> -Marshall
>>
>> On 9/2/2016 5:36 AM, Peter Klgl wrote:
>>> Same here.
>>>
>>>
>>> It looks like that we are now also starting to use the address, and I am
>>> also thinking of using it more in Ruta (internal indexing).
>>>
>>>
>>> Btw, I did some simple experiments lately concerning the stability of
>>> the addresses when using CasIOUtils. Can it happens that the addresses
>>> change if you just deserialize the same CAs twice without serializing it
>>> in between?
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>>
>>> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
>>>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system) unique identifiers for feature structures facilitates handling them in e.g. in editors. We use that quite a bit in WebAnno.
>>>>
>>>> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique identifier. However, I could imagine cases where people might rely on the ID to increment monotonically for new FSes.
>>>>
>>>> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep references to FSes without having to keep the CAS in memory all the time. 
>>>>
>>>> There should continue to be a V3 serialization format which preserves IDs across a load/save cycle. 
>>>>
>>>> I do presently not see a case where a strong similarity between V2 and V3 IDs would be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy thing to do.
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>>>
>>>>> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
>>>>>
>>>>> UIMA V3 implementation includes in many places extra code (takes time / space)
>>>>> whose goal is to make things look closer to version 2.  Some of this is for
>>>>> interoperability with version 2 artifacts, like serialized forms.
>>>>>
>>>>> An example: in v2, many serialization forms include "references" to other
>>>>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>>>>> heap of the FS.
>>>>>
>>>>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>>>>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>>>>> of the serialization code builds a map at serialization time from the v3 id's to
>>>>> v2 "addresses", and uses the latter in the serialization form.
>>>>>
>>>>> Currently, this is done for various binary serializations, so that these can be
>>>>> read back in by v2 code.
>>>>>
>>>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
>>>>> the serialized forms for these differ between v2 and v3, in that the numbers
>>>>> used to represent references to other FSs are different.
>>>>>
>>>>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>>>>> anything other than unique per FS, so there's no issue in deserializing.  But
>>>>> the UIMA community may have built other things that depend on these identifiers
>>>>> not changing. 
>>>>>
>>>>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>>>>> to reproduce (approximately) the same reference numbers as v2?  I say
>>>>> approximately, because other factors might affect these, such as the ordering
>>>>> for things not in "ordered" indexes, etc. between v2 and v3.
>>>>>
>>>>> -Marshall
>>>>>
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Peter Klügl <pe...@averbis.com>.

Tested all formats, did not happen for a reasonable complex CAS.


Am 02.09.2016 um 15:26 schrieb Marshall Schor:
> Re: deserializing the same CAS twice shouldn't change the addresses;  if you
> have a case where it's doing that, I'll investigate (need a small test case...).
>
> -Marshall
>
> On 9/2/2016 5:36 AM, Peter Klgl wrote:
>> Same here.
>>
>>
>> It looks like that we are now also starting to use the address, and I am
>> also thinking of using it more in Ruta (internal indexing).
>>
>>
>> Btw, I did some simple experiments lately concerning the stability of
>> the addresses when using CasIOUtils. Can it happens that the addresses
>> change if you just deserialize the same CAs twice without serializing it
>> in between?
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>>
>> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
>>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system) unique identifiers for feature structures facilitates handling them in e.g. in editors. We use that quite a bit in WebAnno.
>>>
>>> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique identifier. However, I could imagine cases where people might rely on the ID to increment monotonically for new FSes.
>>>
>>> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep references to FSes without having to keep the CAS in memory all the time. 
>>>
>>> There should continue to be a V3 serialization format which preserves IDs across a load/save cycle. 
>>>
>>> I do presently not see a case where a strong similarity between V2 and V3 IDs would be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy thing to do.
>>>
>>> Cheers,
>>>
>>> -- Richard
>>>
>>>> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
>>>>
>>>> UIMA V3 implementation includes in many places extra code (takes time / space)
>>>> whose goal is to make things look closer to version 2.  Some of this is for
>>>> interoperability with version 2 artifacts, like serialized forms.
>>>>
>>>> An example: in v2, many serialization forms include "references" to other
>>>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>>>> heap of the FS.
>>>>
>>>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>>>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>>>> of the serialization code builds a map at serialization time from the v3 id's to
>>>> v2 "addresses", and uses the latter in the serialization form.
>>>>
>>>> Currently, this is done for various binary serializations, so that these can be
>>>> read back in by v2 code.
>>>>
>>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
>>>> the serialized forms for these differ between v2 and v3, in that the numbers
>>>> used to represent references to other FSs are different.
>>>>
>>>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>>>> anything other than unique per FS, so there's no issue in deserializing.  But
>>>> the UIMA community may have built other things that depend on these identifiers
>>>> not changing. 
>>>>
>>>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>>>> to reproduce (approximately) the same reference numbers as v2?  I say
>>>> approximately, because other factors might affect these, such as the ordering
>>>> for things not in "ordered" indexes, etc. between v2 and v3.
>>>>
>>>> -Marshall
>>>>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

Re: deserializing the same CAS twice shouldn't change the addresses;  if you
have a case where it's doing that, I'll investigate (need a small test case...).

-Marshall

On 9/2/2016 5:36 AM, Peter Klgl wrote:
> Same here.
>
>
> It looks like that we are now also starting to use the address, and I am
> also thinking of using it more in Ruta (internal indexing).
>
>
> Btw, I did some simple experiments lately concerning the stability of
> the addresses when using CasIOUtils. Can it happens that the addresses
> change if you just deserialize the same CAs twice without serializing it
> in between?
>
>
> Best,
>
>
> Peter
>
>
>
> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system) unique identifiers for feature structures facilitates handling them in e.g. in editors. We use that quite a bit in WebAnno.
>>
>> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique identifier. However, I could imagine cases where people might rely on the ID to increment monotonically for new FSes.
>>
>> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep references to FSes without having to keep the CAS in memory all the time. 
>>
>> There should continue to be a V3 serialization format which preserves IDs across a load/save cycle. 
>>
>> I do presently not see a case where a strong similarity between V2 and V3 IDs would be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy thing to do.
>>
>> Cheers,
>>
>> -- Richard
>>
>>> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
>>>
>>> UIMA V3 implementation includes in many places extra code (takes time / space)
>>> whose goal is to make things look closer to version 2.  Some of this is for
>>> interoperability with version 2 artifacts, like serialized forms.
>>>
>>> An example: in v2, many serialization forms include "references" to other
>>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>>> heap of the FS.
>>>
>>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>>> of the serialization code builds a map at serialization time from the v3 id's to
>>> v2 "addresses", and uses the latter in the serialization form.
>>>
>>> Currently, this is done for various binary serializations, so that these can be
>>> read back in by v2 code.
>>>
>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
>>> the serialized forms for these differ between v2 and v3, in that the numbers
>>> used to represent references to other FSs are different.
>>>
>>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>>> anything other than unique per FS, so there's no issue in deserializing.  But
>>> the UIMA community may have built other things that depend on these identifiers
>>> not changing. 
>>>
>>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>>> to reproduce (approximately) the same reference numbers as v2?  I say
>>> approximately, because other factors might affect these, such as the ordering
>>> for things not in "ordered" indexes, etc. between v2 and v3.
>>>
>>> -Marshall
>>>
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

yes, good idea :-)  I'll change this in v3, so the id is more likely to
correspond to the v2 one.  I suspect the performance impact will be unnoticable.

-Marshall


On 9/2/2016 8:17 AM, Burn Lewis wrote:
> Could the id assigned in V3 be the same as the V2 address, as if the offset
> in a heap?  Unique and monotonically increasing.
>
> Burn
>
> On Fri, Sep 2, 2016 at 5:36 AM, Peter Kl�gl <pe...@averbis.com>
> wrote:
>
>> Same here.
>>
>>
>> It looks like that we are now also starting to use the address, and I am
>> also thinking of using it more in Ruta (internal indexing).
>>
>>
>> Btw, I did some simple experiments lately concerning the stability of
>> the addresses when using CasIOUtils. Can it happens that the addresses
>> change if you just deserialize the same CAs twice without serializing it
>> in between?
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>>
>> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
>>> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e.
>> out-of-type-system) unique identifiers for feature structures facilitates
>> handling them in e.g. in editors. We use that quite a bit in WebAnno.
>>> In WebAnno, we do not rely on any heap arithmetics - an ID is just
>> expected to be a unique identifier. However, I could imagine cases where
>> people might rely on the ID to increment monotonically for new FSes.
>>> Most binary formats do not preserve the ID across a save/load cycle.
>> However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno
>> makes used of that. It allows to keep references to FSes without having to
>> keep the CAS in memory all the time.
>>> There should continue to be a V3 serialization format which preserves
>> IDs across a load/save cycle.
>>> I do presently not see a case where a strong similarity between V2 and
>> V3 IDs would be important. It would be nice if deserializing a V2
>> SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it
>> to be an easy thing to do.
>>> Cheers,
>>>
>>> -- Richard
>>>
>>>> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
>>>>
>>>> UIMA V3 implementation includes in many places extra code (takes time /
>> space)
>>>> whose goal is to make things look closer to version 2.  Some of this is
>> for
>>>> interoperability with version 2 artifacts, like serialized forms.
>>>>
>>>> An example: in v2, many serialization forms include "references" to
>> other
>>>> Feature Structures (FSs), and for those, the encoding is the "address"
>> in the
>>>> heap of the FS.
>>>>
>>>> In v3, there is no heap, but the FSs have "ids", which are (at the
>> moment) an
>>>> int which increments by 1.  This mis-matches the "address" in v2, so
>> many parts
>>>> of the serialization code builds a map at serialization time from the
>> v3 id's to
>>>> v2 "addresses", and uses the latter in the serialization form.
>>>>
>>>> Currently, this is done for various binary serializations, so that
>> these can be
>>>> read back in by v2 code.
>>>>
>>>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't
>> checked).  So
>>>> the serialized forms for these differ between v2 and v3, in that the
>> numbers
>>>> used to represent references to other FSs are different.
>>>>
>>>> The deserialization code for XMI and JSON doesn't depend on these
>> numbers being
>>>> anything other than unique per FS, so there's no issue in
>> deserializing.  But
>>>> the UIMA community may have built other things that depend on these
>> identifiers
>>>> not changing.
>>>>
>>>> What's your opinion: should the XMI and JSON etc serialization in V3 be
>> changed
>>>> to reproduce (approximately) the same reference numbers as v2?  I say
>>>> approximately, because other factors might affect these, such as the
>> ordering
>>>> for things not in "ordered" indexes, etc. between v2 and v3.
>>>>
>>>> -Marshall
>>>>
>>

Re: UIMA and Spark (was Re: opinion on degree of backwards compatibility for Uima V3 experiment)

Posted by Joern Kottmann <ko...@gmail.com>.

Today you can use Uima AEs fully encapsulated in a Spark function without
much effort.

I think it would be nice if as a user I could just use the CAS with Spark
like I showed in the example. This is probably not very difficult to
achieve.

Jörn

On Sep 9, 2016 18:50, "Richard Eckart de Castilho" <re...@apache.org> wrote:

On 09.09.2016, at 11:38, Joern Kottmann <ko...@gmail.com> wrote:
>
> I think the best way to answer this question is to write a few fully
working
> simple examples which use UIMA 2 and different Hadoop frameworks,
> e.g. MapReduce, Spark, etc. and see how we can make it a pleasure to use
> UIMA with those.

Some time back, Philip Ogren has produced some nice high-level slides on
UIMA + Spark:

https://spark-summit.org/2014/wp-content/uploads/2014/07/
Leveraging-UIMA-in-Spark-Philip-Ogren.pdf

Cheers,

-- Richard

UIMA and Spark (was Re: opinion on degree of backwards compatibility for Uima V3 experiment)

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 09.09.2016, at 11:38, Joern Kottmann <ko...@gmail.com> wrote:
> 
> I think the best way to answer this question is to write a few fully working
> simple examples which use UIMA 2 and different Hadoop frameworks,
> e.g. MapReduce, Spark, etc. and see how we can make it a pleasure to use
> UIMA with those.

Some time back, Philip Ogren has produced some nice high-level slides on UIMA + Spark:

https://spark-summit.org/2014/wp-content/uploads/2014/07/Leveraging-UIMA-in-Spark-Philip-Ogren.pdf

Cheers,

-- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Joern Kottmann <ko...@gmail.com>.

I think the best way to answer this question is to write a few fully working
simple examples which use UIMA 2 and different Hadoop frameworks,
e.g. MapReduce, Spark, etc. and see how we can make it a pleasure to use
UIMA
with those.

I sketched out some Spark code which shows how I would like to use UIMA.
But I think today things are much more complex and some things are not
possible, or fast (CAS is not designed to be immutable).

  public static void main(String[] args) {

    TypeSystem ts = ...

    JavaSparkContext sc = new JavaSparkContext();
    JavaRDD<String> texts = sc.textFile("hdfs://...");

    JavaRDD<CAS> docs = texts.map(new Function<String, CAS>() {
      @Override
      public CAS call(String text) throws Exception {
        // create CAS from shared type system
        // add the text to the cas
      }
    });

    AnalysisEngine tokenizer = ...

    docs = docs.map(cas -> tokenizer.process(cas));

    Type tokenType = null;
    JavaRDD<Integer> counts = docs.map(new Function<CAS, Integer>() {
      @Override
      public Integer call(CAS cas) throws Exception {
        int tokenCount = 0;
        for (FeatureStructure fs : cas.getAnnotationIndex(tokenType)) {
          tokenCount++;
        }
        return tokenCount++;
      }
    });

    counts.reduce((a, b) -> a + b);
  }

Jörn




On Wed, Sep 7, 2016 at 3:45 PM, Marshall Schor <ms...@schor.com> wrote:

> Hi Jörn,
>
> Thanks for your input.  Could you possible expand with a few specifics on
> what
> changes you think would make it easier to use with Hadoop etc.?
>
> -Marshall
>
>
> On 9/7/2016 7:46 AM, Joern Kottmann wrote:
> > Hello all,
> >
> > at my work place we use UIMA mostly with custom code to load data into a
> > pipeline and store its results,
> > therefore we don't depend at all on the UIMA serialization formats. And
> > changing them, or adding new ones which
> > are incompatible wouldn't be an issue at all. Also the existing code can
> be
> > ported to work with UIMA 3.
> >
> > I really hope we can get UIMA 3 into a shape where it is easier to use
> with
> > todays requirements (e.g. with Hadoop)
> > and possibilities.
> >
> > I personally think that the effort to create the next overhauled version
> > shouldn't be limited in anyway by backward compatibility.
> > For me it is a good solution if there is some help with migrating things
> to
> > UIMA 3 (e.g. a guide which explains what to do)
> > and maybe maintaining UIMA 2 for a while in parallel (e.g. fixes of very
> > urgent/critical bugs).
> >
> > Jörn
> >
> > On Fri, Sep 2, 2016 at 7:56 PM, Richard Eckart de Castilho <
> rec@apache.org>
> > wrote:
> >
> >> See comment at end of mail.
> >>
> >> On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
> >>> To go from an ID to an FS is not generally possible, because normally,
> >> the
> >>> framework doesn't keep this association.  There are exceptions though,
> >> the main
> >>> ones being:
> >>>
> >>> a) If you use low level CAS Apis to create FSs, the API returns the ID,
> >> which
> >>> means, that a GC that happens right after the API returns would garbage
> >> collect
> >>> the FS because at that point, nothing is "holding on" to any reference
> >> (it's not
> >>> in any index).  To prevent this, the low level create FS methods add
> the
> >> FS to a
> >>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing
> >> Garbage
> >>> collection.
> >>>
> >>> b) Another case where this happens is when PEARs are used; in this case
> >> the FSs
> >>> involved with PEAR "trampoline" FSs end up being in similar maps.
> >>>
> >>> Both of these approaches of course disable a feature of V3 - namely,
> that
> >>> unrefererenced FSs can be garbage collected.
> >>>
> >>> ...
> >>>
> >>> There is an API in the V3 CASImpl, getFsFromId(int)  and also
> >>> getFsFromId_checked(int), which retrieves the associated FS, given the
> >> ID, or
> >>> returns null (or throws an exception) if it isn't in the table.  Most
> FSs
> >>> created normally, won't be in the table.
> >> Can we do this? -> As soon as an FS has been added to an index or is
> being
> >> referenced from another FS, its ID should be resolvable to the
> respective
> >> FS.
> >>
> >> When an FS is in an index or being referred by another FS, it cannot be
> >> garbage collected anyway. The CAS could maintain a lookup using weak
> >> references to provides a central place to look up such FSes via their
> IDs
> >> without preventing garbage collection.
> >>
> >> WebAnno remembers the ID of every FS rendered on screen. When the user
> >> makes an action, we load the CAS from disk and then look up the ID to
> >> retrieve the FS. We do not keep the CAS in memory all the time. If we
> would
> >> have to scan the whole CAS for the FS with a given ID, it would have
> >> probably a serious performance impact.
> >>
> >> Cheers,
> >>
> >> -- Richard
>
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

Hi J�rn,

Thanks for your input.  Could you possible expand with a few specifics on what
changes you think would make it easier to use with Hadoop etc.?

-Marshall


On 9/7/2016 7:46 AM, Joern Kottmann wrote:
> Hello all,
>
> at my work place we use UIMA mostly with custom code to load data into a
> pipeline and store its results,
> therefore we don't depend at all on the UIMA serialization formats. And
> changing them, or adding new ones which
> are incompatible wouldn't be an issue at all. Also the existing code can be
> ported to work with UIMA 3.
>
> I really hope we can get UIMA 3 into a shape where it is easier to use with
> todays requirements (e.g. with Hadoop)
> and possibilities.
>
> I personally think that the effort to create the next overhauled version
> shouldn't be limited in anyway by backward compatibility.
> For me it is a good solution if there is some help with migrating things to
> UIMA 3 (e.g. a guide which explains what to do)
> and maybe maintaining UIMA 2 for a while in parallel (e.g. fixes of very
> urgent/critical bugs).
>
> J�rn
>
> On Fri, Sep 2, 2016 at 7:56 PM, Richard Eckart de Castilho <re...@apache.org>
> wrote:
>
>> See comment at end of mail.
>>
>> On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
>>> To go from an ID to an FS is not generally possible, because normally,
>> the
>>> framework doesn't keep this association.  There are exceptions though,
>> the main
>>> ones being:
>>>
>>> a) If you use low level CAS Apis to create FSs, the API returns the ID,
>> which
>>> means, that a GC that happens right after the API returns would garbage
>> collect
>>> the FS because at that point, nothing is "holding on" to any reference
>> (it's not
>>> in any index).  To prevent this, the low level create FS methods add the
>> FS to a
>>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing
>> Garbage
>>> collection.
>>>
>>> b) Another case where this happens is when PEARs are used; in this case
>> the FSs
>>> involved with PEAR "trampoline" FSs end up being in similar maps.
>>>
>>> Both of these approaches of course disable a feature of V3 - namely, that
>>> unrefererenced FSs can be garbage collected.
>>>
>>> ...
>>>
>>> There is an API in the V3 CASImpl, getFsFromId(int)  and also
>>> getFsFromId_checked(int), which retrieves the associated FS, given the
>> ID, or
>>> returns null (or throws an exception) if it isn't in the table.  Most FSs
>>> created normally, won't be in the table.
>> Can we do this? -> As soon as an FS has been added to an index or is being
>> referenced from another FS, its ID should be resolvable to the respective
>> FS.
>>
>> When an FS is in an index or being referred by another FS, it cannot be
>> garbage collected anyway. The CAS could maintain a lookup using weak
>> references to provides a central place to look up such FSes via their IDs
>> without preventing garbage collection.
>>
>> WebAnno remembers the ID of every FS rendered on screen. When the user
>> makes an action, we load the CAS from disk and then look up the ID to
>> retrieve the FS. We do not keep the CAS in memory all the time. If we would
>> have to scan the whole CAS for the FS with a given ID, it would have
>> probably a serious performance impact.
>>
>> Cheers,
>>
>> -- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Joern Kottmann <ko...@gmail.com>.

Hello all,

at my work place we use UIMA mostly with custom code to load data into a
pipeline and store its results,
therefore we don't depend at all on the UIMA serialization formats. And
changing them, or adding new ones which
are incompatible wouldn't be an issue at all. Also the existing code can be
ported to work with UIMA 3.

I really hope we can get UIMA 3 into a shape where it is easier to use with
todays requirements (e.g. with Hadoop)
and possibilities.

I personally think that the effort to create the next overhauled version
shouldn't be limited in anyway by backward compatibility.
For me it is a good solution if there is some help with migrating things to
UIMA 3 (e.g. a guide which explains what to do)
and maybe maintaining UIMA 2 for a while in parallel (e.g. fixes of very
urgent/critical bugs).

Jörn

On Fri, Sep 2, 2016 at 7:56 PM, Richard Eckart de Castilho <re...@apache.org>
wrote:

> See comment at end of mail.
>
> On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
> >
> > To go from an ID to an FS is not generally possible, because normally,
> the
> > framework doesn't keep this association.  There are exceptions though,
> the main
> > ones being:
> >
> > a) If you use low level CAS Apis to create FSs, the API returns the ID,
> which
> > means, that a GC that happens right after the API returns would garbage
> collect
> > the FS because at that point, nothing is "holding on" to any reference
> (it's not
> > in any index).  To prevent this, the low level create FS methods add the
> FS to a
> > map which goes from ID -> FS, and thus "holds onto" the FS, preventing
> Garbage
> > collection.
> >
> > b) Another case where this happens is when PEARs are used; in this case
> the FSs
> > involved with PEAR "trampoline" FSs end up being in similar maps.
> >
> > Both of these approaches of course disable a feature of V3 - namely, that
> > unrefererenced FSs can be garbage collected.
> >
> > ...
> >
>
> > There is an API in the V3 CASImpl, getFsFromId(int)  and also
> > getFsFromId_checked(int), which retrieves the associated FS, given the
> ID, or
> > returns null (or throws an exception) if it isn't in the table.  Most FSs
> > created normally, won't be in the table.
>
> Can we do this? -> As soon as an FS has been added to an index or is being
> referenced from another FS, its ID should be resolvable to the respective
> FS.
>
> When an FS is in an index or being referred by another FS, it cannot be
> garbage collected anyway. The CAS could maintain a lookup using weak
> references to provides a central place to look up such FSes via their IDs
> without preventing garbage collection.
>
> WebAnno remembers the ID of every FS rendered on screen. When the user
> makes an action, we load the CAS from disk and then look up the ID to
> retrieve the FS. We do not keep the CAS in memory all the time. If we would
> have to scan the whole CAS for the FS with a given ID, it would have
> probably a serious performance impact.
>
> Cheers,
>
> -- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Jens Grivolla <j+...@grivolla.net>.

Hi, at some point we would have wished for (stable) FS IDs to be able to
reference annotations, especially when trying to work outside of UIMA (and
possibly Java). For that we would actually have liked to have something
more geared towards users than the IDs that appear e.g. in XMIs, with clear
documentation of what those IDs represent and how to deal with them e.g.
when generating or modifying XMI outside of UIMA.

On the other hand, backwards compatibility with the V2 addresses is not a
concern for us at all.

Best,
Jens

On Thu, Sep 8, 2016 at 3:27 PM, Marshall Schor <ms...@schor.com> wrote:

> It seems that some (but not all) users really like and make use of
>
> * int "id"s that are stable and don't change due to loading/saving
> * to get "direct access" to FSs using these "id"s
> * want UIMA framework support for this
>
> I state this based on a history over time of multiple discussions on
> various
> lists, about this topic.
>
> Up to now, these users have been using internal data in V2 (the "address"
> in the
> low level representation), which is stable for some load/save operations
> but not
> others.
>
> Supporting this costs two things:
> * space - in each FS, for the int "id" and
> * space/time to hold and update a map from "id" to the FS for direct
> access.
> This map would likely have "weak references" (an additional Java Object
> overhead
> per FS) to permit GC to work. (The use of weak refs could be an option, as
> well).
>
> We could support such a thing in V3 based on some pipeline setting (e.g.
> using
> additionalParameters options); this would permit freeing the use of
> internal
> id's etc., to be more just for internal use.
>
> Is this a reasonable description of this "use case"? Does it seem
> reasonable for
> V3 to support such a thing?
>
> -Marshall
>
>
> On 9/2/2016 1:56 PM, Richard Eckart de Castilho wrote:
> > See comment at end of mail.
> >
> > On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
> >> To go from an ID to an FS is not generally possible, because normally,
> the
> >> framework doesn't keep this association.  There are exceptions though,
> the main
> >> ones being:
> >>
> >> a) If you use low level CAS Apis to create FSs, the API returns the ID,
> which
> >> means, that a GC that happens right after the API returns would garbage
> collect
> >> the FS because at that point, nothing is "holding on" to any reference
> (it's not
> >> in any index).  To prevent this, the low level create FS methods add
> the FS to a
> >> map which goes from ID -> FS, and thus "holds onto" the FS, preventing
> Garbage
> >> collection.
> >>
> >> b) Another case where this happens is when PEARs are used; in this case
> the FSs
> >> involved with PEAR "trampoline" FSs end up being in similar maps.
> >>
> >> Both of these approaches of course disable a feature of V3 - namely,
> that
> >> unrefererenced FSs can be garbage collected.
> >>
> >> ...
> >>
> >> There is an API in the V3 CASImpl, getFsFromId(int)  and also
> >> getFsFromId_checked(int), which retrieves the associated FS, given the
> ID, or
> >> returns null (or throws an exception) if it isn't in the table.  Most
> FSs
> >> created normally, won't be in the table.
> > Can we do this? -> As soon as an FS has been added to an index or is
> being referenced from another FS, its ID should be resolvable to the
> respective FS.
> >
> > When an FS is in an index or being referred by another FS, it cannot be
> garbage collected anyway. The CAS could maintain a lookup using weak
> references to provides a central place to look up such FSes via their IDs
> without preventing garbage collection.
> >
> > WebAnno remembers the ID of every FS rendered on screen. When the user
> makes an action, we load the CAS from disk and then look up the ID to
> retrieve the FS. We do not keep the CAS in memory all the time. If we would
> have to scan the whole CAS for the FS with a given ID, it would have
> probably a serious performance impact.
> >
> > Cheers,
> >
> > -- Richard
>
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


ids are just too convenient when you build some editor or in other use
cases where you want to modify an annotation but cannot keep the CAS. I
assume that changing the ids when storing the CAS could be OK.


We should at least try to support them and see how bad the performance
drop is.


Btw, I already have another use case where I use them: Applying ruta
rules directly on annotation objects in java code. Here, the address/id
is injected in the rule string and then resolved later again within the
ruta impl.

Ruta.matches(jcas, Ruta.inject("${PARTOF(Person)} NUM;", annotation))

This returns true if the given annotation if covered by a Person
annotation and is followed by a NUM annotation.

I like to extend this functionality in ruta in future, and I do not see
how I can keep it without ids.


Best,


Peter



Am 08.09.2016 um 15:27 schrieb Marshall Schor:
> It seems that some (but not all) users really like and make use of
>
> * int "id"s that are stable and don't change due to loading/saving
> * to get "direct access" to FSs using these "id"s
> * want UIMA framework support for this
>
> I state this based on a history over time of multiple discussions on various
> lists, about this topic.
>
> Up to now, these users have been using internal data in V2 (the "address" in the
> low level representation), which is stable for some load/save operations but not
> others.
>
> Supporting this costs two things:
> * space - in each FS, for the int "id" and
> * space/time to hold and update a map from "id" to the FS for direct access. 
> This map would likely have "weak references" (an additional Java Object overhead
> per FS) to permit GC to work. (The use of weak refs could be an option, as well).
>
> We could support such a thing in V3 based on some pipeline setting (e.g. using
> additionalParameters options); this would permit freeing the use of internal
> id's etc., to be more just for internal use.
>
> Is this a reasonable description of this "use case"? Does it seem reasonable for
> V3 to support such a thing?
>
> -Marshall
>
>
> On 9/2/2016 1:56 PM, Richard Eckart de Castilho wrote:
>> See comment at end of mail.
>>
>> On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
>>> To go from an ID to an FS is not generally possible, because normally, the
>>> framework doesn't keep this association.  There are exceptions though, the main
>>> ones being:
>>>
>>> a) If you use low level CAS Apis to create FSs, the API returns the ID, which
>>> means, that a GC that happens right after the API returns would garbage collect
>>> the FS because at that point, nothing is "holding on" to any reference (it's not
>>> in any index).  To prevent this, the low level create FS methods add the FS to a
>>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
>>> collection.
>>>
>>> b) Another case where this happens is when PEARs are used; in this case the FSs
>>> involved with PEAR "trampoline" FSs end up being in similar maps.
>>>
>>> Both of these approaches of course disable a feature of V3 - namely, that
>>> unrefererenced FSs can be garbage collected.
>>>
>>> ...
>>>
>>> There is an API in the V3 CASImpl, getFsFromId(int)  and also
>>> getFsFromId_checked(int), which retrieves the associated FS, given the ID, or
>>> returns null (or throws an exception) if it isn't in the table.  Most FSs
>>> created normally, won't be in the table.
>> Can we do this? -> As soon as an FS has been added to an index or is being referenced from another FS, its ID should be resolvable to the respective FS.
>>
>> When an FS is in an index or being referred by another FS, it cannot be garbage collected anyway. The CAS could maintain a lookup using weak references to provides a central place to look up such FSes via their IDs without preventing garbage collection.
>>
>> WebAnno remembers the ID of every FS rendered on screen. When the user makes an action, we load the CAS from disk and then look up the ID to retrieve the FS. We do not keep the CAS in memory all the time. If we would have to scan the whole CAS for the FS with a given ID, it would have probably a serious performance impact.
>>
>> Cheers,
>>
>> -- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

It seems that some (but not all) users really like and make use of

* int "id"s that are stable and don't change due to loading/saving
* to get "direct access" to FSs using these "id"s
* want UIMA framework support for this

I state this based on a history over time of multiple discussions on various
lists, about this topic.

Up to now, these users have been using internal data in V2 (the "address" in the
low level representation), which is stable for some load/save operations but not
others.

Supporting this costs two things:
* space - in each FS, for the int "id" and
* space/time to hold and update a map from "id" to the FS for direct access. 
This map would likely have "weak references" (an additional Java Object overhead
per FS) to permit GC to work. (The use of weak refs could be an option, as well).

We could support such a thing in V3 based on some pipeline setting (e.g. using
additionalParameters options); this would permit freeing the use of internal
id's etc., to be more just for internal use.

Is this a reasonable description of this "use case"? Does it seem reasonable for
V3 to support such a thing?

-Marshall


On 9/2/2016 1:56 PM, Richard Eckart de Castilho wrote:
> See comment at end of mail.
>
> On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
>> To go from an ID to an FS is not generally possible, because normally, the
>> framework doesn't keep this association.  There are exceptions though, the main
>> ones being:
>>
>> a) If you use low level CAS Apis to create FSs, the API returns the ID, which
>> means, that a GC that happens right after the API returns would garbage collect
>> the FS because at that point, nothing is "holding on" to any reference (it's not
>> in any index).  To prevent this, the low level create FS methods add the FS to a
>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
>> collection.
>>
>> b) Another case where this happens is when PEARs are used; in this case the FSs
>> involved with PEAR "trampoline" FSs end up being in similar maps.
>>
>> Both of these approaches of course disable a feature of V3 - namely, that
>> unrefererenced FSs can be garbage collected.
>>
>> ...
>>
>> There is an API in the V3 CASImpl, getFsFromId(int)  and also
>> getFsFromId_checked(int), which retrieves the associated FS, given the ID, or
>> returns null (or throws an exception) if it isn't in the table.  Most FSs
>> created normally, won't be in the table.
> Can we do this? -> As soon as an FS has been added to an index or is being referenced from another FS, its ID should be resolvable to the respective FS.
>
> When an FS is in an index or being referred by another FS, it cannot be garbage collected anyway. The CAS could maintain a lookup using weak references to provides a central place to look up such FSes via their IDs without preventing garbage collection.
>
> WebAnno remembers the ID of every FS rendered on screen. When the user makes an action, we load the CAS from disk and then look up the ID to retrieve the FS. We do not keep the CAS in memory all the time. If we would have to scan the whole CAS for the FS with a given ID, it would have probably a serious performance impact.
>
> Cheers,
>
> -- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Richard Eckart de Castilho <re...@apache.org>.

See comment at end of mail.

On 02.09.2016, at 15:18, Marshall Schor <ms...@schor.com> wrote:
> 
> To go from an ID to an FS is not generally possible, because normally, the
> framework doesn't keep this association.  There are exceptions though, the main
> ones being:
> 
> a) If you use low level CAS Apis to create FSs, the API returns the ID, which
> means, that a GC that happens right after the API returns would garbage collect
> the FS because at that point, nothing is "holding on" to any reference (it's not
> in any index).  To prevent this, the low level create FS methods add the FS to a
> map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
> collection.
> 
> b) Another case where this happens is when PEARs are used; in this case the FSs
> involved with PEAR "trampoline" FSs end up being in similar maps.
> 
> Both of these approaches of course disable a feature of V3 - namely, that
> unrefererenced FSs can be garbage collected.
> 
> ...
> 

> There is an API in the V3 CASImpl, getFsFromId(int)  and also
> getFsFromId_checked(int), which retrieves the associated FS, given the ID, or
> returns null (or throws an exception) if it isn't in the table.  Most FSs
> created normally, won't be in the table.

Can we do this? -> As soon as an FS has been added to an index or is being referenced from another FS, its ID should be resolvable to the respective FS.

When an FS is in an index or being referred by another FS, it cannot be garbage collected anyway. The CAS could maintain a lookup using weak references to provides a central place to look up such FSes via their IDs without preventing garbage collection.

WebAnno remembers the ID of every FS rendered on screen. When the user makes an action, we load the CAS from disk and then look up the ID to retrieve the FS. We do not keep the CAS in memory all the time. If we would have to scan the whole CAS for the FS with a given ID, it would have probably a serious performance impact.

Cheers,

-- Richard

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

Hi,

In v3, FSs are represented completely by Java class instances.  The indexes are
indexes whose values are FSs, directly. FSs which reference other FSs have
direct references to them, and don't use IDs.

The IDs are used for backwards compatibility - to support the CAS Low Level
APIs. This is an official UIMA public API (LowLevelCAS interface), and projects
who were trying to get higher performance sometimes made use of this.  With this
API, you could implement annotators that didn't create any Java objects, for
example.  This used to be important, 15 years ago, when Java was "new".

In V3, using the low level APIs is supported, but would actually be less
efficient than using the normal Java APIs. The low level APIs refer to FSs using
their ID's, which are "ints".  To make that work, those FSs which are created
with low level APIs are put into a map which maps the ID to the FS.

Re: Testing v3 -
_____________

remember, the v3 branch is not currently in a good state due to me being in the
middle of merge catching up from the recent flurry of changes to get out
v2.9.0.  Right now, for instance, one of the new test cases has uncovered a
missing part of the binary deserialization (delta) implementation in v3, and I'm
working on figuring out how to fix this.

There is an API in the V3 CASImpl, getFsFromId(int)  and also
getFsFromId_checked(int), which retrieves the associated FS, given the ID, or
returns null (or throws an exception) if it isn't in the table.  Most FSs
created normally, won't be in the table.

The recommended way to deal with this is to use (in Java) actual references to
the FSs, in place of the IDs, which is what the v3 framework does.

Hope that answers your question; if not, ask more :-)

-Marshall

On 9/2/2016 9:31 AM, Peter Kl�gl wrote:
> What does this mean?
>
> ID -> FS is not possible in v3, or only with low level API?
>
> Testing v3 and taking a closer look is still on my todo list, but I
> found not the time yet.
>
> Best,
>
> Peter
>
>
> Am 02.09.2016 um 15:18 schrieb Marshall Schor:
>> In v3, there are fast lookups FS -> ID :
>>
>>    myFs._id()  // compiles to a fetch of a final int field in the FS object
>>
>> To go from an ID to an FS is not generally possible, because normally, the
>> framework doesn't keep this association.  There are exceptions though, the main
>> ones being:
>>
>> a) If you use low level CAS Apis to create FSs, the API returns the ID, which
>> means, that a GC that happens right after the API returns would garbage collect
>> the FS because at that point, nothing is "holding on" to any reference (it's not
>> in any index).  To prevent this, the low level create FS methods add the FS to a
>> map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
>> collection.
>>
>> b) Another case where this happens is when PEARs are used; in this case the FSs
>> involved with PEAR "trampoline" FSs end up being in similar maps.
>>
>> Both of these approaches of course disable a feature of V3 - namely, that
>> unrefererenced FSs can be garbage collected.
>>
>> -Marshall
>>
>>
>> On 9/2/2016 8:47 AM, Richard Eckart de Castilho wrote:
>>> Fast lookups ID -> FS and FS -> ID would also be very much appreciated :)
>>>
>>> Cheers,
>>>
>>> -- Richard
>>>
>>>> On 02.09.2016, at 14:17, Burn Lewis <bu...@gmail.com> wrote:
>>>>
>>>> Could the id assigned in V3 be the same as the V2 address, as if the offset
>>>> in a heap?  Unique and monotonically increasing.
>>>>
>>>> Burn
>>>>
>>>> On Fri, Sep 2, 2016 at 5:36 AM, Peter Kl�gl <pe...@averbis.com>
>>>> wrote:
>>>>
>>>>> Same here.
>>>>>
>>>>>
>>>>> It looks like that we are now also starting to use the address, and I am
>>>>> also thinking of using it more in Ruta (internal indexing).
>>>>>
>>>>>
>>>>> Btw, I did some simple experiments lately concerning the stability of
>>>>> the addresses when using CasIOUtils. Can it happens that the addresses
>>>>> change if you just deserialize the same CAs twice without serializing it
>>>>> in between?
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Peter Klügl <pe...@averbis.com>.

What does this mean?

ID -> FS is not possible in v3, or only with low level API?

Testing v3 and taking a closer look is still on my todo list, but I
found not the time yet.

Best,

Peter


Am 02.09.2016 um 15:18 schrieb Marshall Schor:
> In v3, there are fast lookups FS -> ID :
>
>    myFs._id()  // compiles to a fetch of a final int field in the FS object
>
> To go from an ID to an FS is not generally possible, because normally, the
> framework doesn't keep this association.  There are exceptions though, the main
> ones being:
>
> a) If you use low level CAS Apis to create FSs, the API returns the ID, which
> means, that a GC that happens right after the API returns would garbage collect
> the FS because at that point, nothing is "holding on" to any reference (it's not
> in any index).  To prevent this, the low level create FS methods add the FS to a
> map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
> collection.
>
> b) Another case where this happens is when PEARs are used; in this case the FSs
> involved with PEAR "trampoline" FSs end up being in similar maps.
>
> Both of these approaches of course disable a feature of V3 - namely, that
> unrefererenced FSs can be garbage collected.
>
> -Marshall
>
>
> On 9/2/2016 8:47 AM, Richard Eckart de Castilho wrote:
>> Fast lookups ID -> FS and FS -> ID would also be very much appreciated :)
>>
>> Cheers,
>>
>> -- Richard
>>
>>> On 02.09.2016, at 14:17, Burn Lewis <bu...@gmail.com> wrote:
>>>
>>> Could the id assigned in V3 be the same as the V2 address, as if the offset
>>> in a heap?  Unique and monotonically increasing.
>>>
>>> Burn
>>>
>>> On Fri, Sep 2, 2016 at 5:36 AM, Peter Kl�gl <pe...@averbis.com>
>>> wrote:
>>>
>>>> Same here.
>>>>
>>>>
>>>> It looks like that we are now also starting to use the address, and I am
>>>> also thinking of using it more in Ruta (internal indexing).
>>>>
>>>>
>>>> Btw, I did some simple experiments lately concerning the stability of
>>>> the addresses when using CasIOUtils. Can it happens that the addresses
>>>> change if you just deserialize the same CAs twice without serializing it
>>>> in between?

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Marshall Schor <ms...@schor.com>.

In v3, there are fast lookups FS -> ID :

   myFs._id()  // compiles to a fetch of a final int field in the FS object

To go from an ID to an FS is not generally possible, because normally, the
framework doesn't keep this association.  There are exceptions though, the main
ones being:

a) If you use low level CAS Apis to create FSs, the API returns the ID, which
means, that a GC that happens right after the API returns would garbage collect
the FS because at that point, nothing is "holding on" to any reference (it's not
in any index).  To prevent this, the low level create FS methods add the FS to a
map which goes from ID -> FS, and thus "holds onto" the FS, preventing Garbage
collection.

b) Another case where this happens is when PEARs are used; in this case the FSs
involved with PEAR "trampoline" FSs end up being in similar maps.

Both of these approaches of course disable a feature of V3 - namely, that
unrefererenced FSs can be garbage collected.

-Marshall

On 9/2/2016 8:47 AM, Richard Eckart de Castilho wrote:
> Fast lookups ID -> FS and FS -> ID would also be very much appreciated :)
>
> Cheers,
>
> -- Richard
>
>> On 02.09.2016, at 14:17, Burn Lewis <bu...@gmail.com> wrote:
>>
>> Could the id assigned in V3 be the same as the V2 address, as if the offset
>> in a heap?  Unique and monotonically increasing.
>>
>> Burn
>>
>> On Fri, Sep 2, 2016 at 5:36 AM, Peter Kl�gl <pe...@averbis.com>
>> wrote:
>>
>>> Same here.
>>>
>>>
>>> It looks like that we are now also starting to use the address, and I am
>>> also thinking of using it more in Ruta (internal indexing).
>>>
>>>
>>> Btw, I did some simple experiments lately concerning the stability of
>>> the addresses when using CasIOUtils. Can it happens that the addresses
>>> change if you just deserialize the same CAs twice without serializing it
>>> in between?
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Richard Eckart de Castilho <re...@apache.org>.

Fast lookups ID -> FS and FS -> ID would also be very much appreciated :)

Cheers,

-- Richard

> On 02.09.2016, at 14:17, Burn Lewis <bu...@gmail.com> wrote:
> 
> Could the id assigned in V3 be the same as the V2 address, as if the offset
> in a heap?  Unique and monotonically increasing.
> 
> Burn
> 
> On Fri, Sep 2, 2016 at 5:36 AM, Peter Klügl <pe...@averbis.com>
> wrote:
> 
>> Same here.
>> 
>> 
>> It looks like that we are now also starting to use the address, and I am
>> also thinking of using it more in Ruta (internal indexing).
>> 
>> 
>> Btw, I did some simple experiments lately concerning the stability of
>> the addresses when using CasIOUtils. Can it happens that the addresses
>> change if you just deserialize the same CAs twice without serializing it
>> in between?

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Burn Lewis <bu...@gmail.com>.

Could the id assigned in V3 be the same as the V2 address, as if the offset
in a heap?  Unique and monotonically increasing.

Burn

On Fri, Sep 2, 2016 at 5:36 AM, Peter Klügl <pe...@averbis.com>
wrote:

> Same here.
>
>
> It looks like that we are now also starting to use the address, and I am
> also thinking of using it more in Ruta (internal indexing).
>
>
> Btw, I did some simple experiments lately concerning the stability of
> the addresses when using CasIOUtils. Can it happens that the addresses
> change if you just deserialize the same CAs twice without serializing it
> in between?
>
>
> Best,
>
>
> Peter
>
>
>
> Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
> > FS IDs are IMHO a very useful thing. Providing out-of-band (i.e.
> out-of-type-system) unique identifiers for feature structures facilitates
> handling them in e.g. in editors. We use that quite a bit in WebAnno.
> >
> > In WebAnno, we do not rely on any heap arithmetics - an ID is just
> expected to be a unique identifier. However, I could imagine cases where
> people might rely on the ID to increment monotonically for new FSes.
> >
> > Most binary formats do not preserve the ID across a save/load cycle.
> However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno
> makes used of that. It allows to keep references to FSes without having to
> keep the CAS in memory all the time.
> >
> > There should continue to be a V3 serialization format which preserves
> IDs across a load/save cycle.
> >
> > I do presently not see a case where a strong similarity between V2 and
> V3 IDs would be important. It would be nice if deserializing a V2
> SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it
> to be an easy thing to do.
> >
> > Cheers,
> >
> > -- Richard
> >
> >> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
> >>
> >> UIMA V3 implementation includes in many places extra code (takes time /
> space)
> >> whose goal is to make things look closer to version 2.  Some of this is
> for
> >> interoperability with version 2 artifacts, like serialized forms.
> >>
> >> An example: in v2, many serialization forms include "references" to
> other
> >> Feature Structures (FSs), and for those, the encoding is the "address"
> in the
> >> heap of the FS.
> >>
> >> In v3, there is no heap, but the FSs have "ids", which are (at the
> moment) an
> >> int which increments by 1.  This mis-matches the "address" in v2, so
> many parts
> >> of the serialization code builds a map at serialization time from the
> v3 id's to
> >> v2 "addresses", and uses the latter in the serialization form.
> >>
> >> Currently, this is done for various binary serializations, so that
> these can be
> >> read back in by v2 code.
> >>
> >> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't
> checked).  So
> >> the serialized forms for these differ between v2 and v3, in that the
> numbers
> >> used to represent references to other FSs are different.
> >>
> >> The deserialization code for XMI and JSON doesn't depend on these
> numbers being
> >> anything other than unique per FS, so there's no issue in
> deserializing.  But
> >> the UIMA community may have built other things that depend on these
> identifiers
> >> not changing.
> >>
> >> What's your opinion: should the XMI and JSON etc serialization in V3 be
> changed
> >> to reproduce (approximately) the same reference numbers as v2?  I say
> >> approximately, because other factors might affect these, such as the
> ordering
> >> for things not in "ordered" indexes, etc. between v2 and v3.
> >>
> >> -Marshall
> >>
>
>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Peter Klügl <pe...@averbis.com>.

Same here.


It looks like that we are now also starting to use the address, and I am
also thinking of using it more in Ruta (internal indexing).


Btw, I did some simple experiments lately concerning the stability of
the addresses when using CasIOUtils. Can it happens that the addresses
change if you just deserialize the same CAs twice without serializing it
in between?


Best,


Peter



Am 01.09.2016 um 19:29 schrieb Richard Eckart de Castilho:
> FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system) unique identifiers for feature structures facilitates handling them in e.g. in editors. We use that quite a bit in WebAnno.
>
> In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique identifier. However, I could imagine cases where people might rely on the ID to increment monotonically for new FSes.
>
> Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep references to FSes without having to keep the CAS in memory all the time. 
>
> There should continue to be a V3 serialization format which preserves IDs across a load/save cycle. 
>
> I do presently not see a case where a strong similarity between V2 and V3 IDs would be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy thing to do.
>
> Cheers,
>
> -- Richard
>
>> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
>>
>> UIMA V3 implementation includes in many places extra code (takes time / space)
>> whose goal is to make things look closer to version 2.  Some of this is for
>> interoperability with version 2 artifacts, like serialized forms.
>>
>> An example: in v2, many serialization forms include "references" to other
>> Feature Structures (FSs), and for those, the encoding is the "address" in the
>> heap of the FS.
>>
>> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
>> int which increments by 1.  This mis-matches the "address" in v2, so many parts
>> of the serialization code builds a map at serialization time from the v3 id's to
>> v2 "addresses", and uses the latter in the serialization form.
>>
>> Currently, this is done for various binary serializations, so that these can be
>> read back in by v2 code.
>>
>> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
>> the serialized forms for these differ between v2 and v3, in that the numbers
>> used to represent references to other FSs are different.
>>
>> The deserialization code for XMI and JSON doesn't depend on these numbers being
>> anything other than unique per FS, so there's no issue in deserializing.  But
>> the UIMA community may have built other things that depend on these identifiers
>> not changing. 
>>
>> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
>> to reproduce (approximately) the same reference numbers as v2?  I say
>> approximately, because other factors might affect these, such as the ordering
>> for things not in "ordered" indexes, etc. between v2 and v3.
>>
>> -Marshall
>>

Re: opinion on degree of backwards compatibility for Uima V3 experiment

Posted by Richard Eckart de Castilho <re...@apache.org>.

FS IDs are IMHO a very useful thing. Providing out-of-band (i.e. out-of-type-system) unique identifiers for feature structures facilitates handling them in e.g. in editors. We use that quite a bit in WebAnno.

In WebAnno, we do not rely on any heap arithmetics - an ID is just expected to be a unique identifier. However, I could imagine cases where people might rely on the ID to increment monotonically for new FSes.

Most binary formats do not preserve the ID across a save/load cycle. However, SERIALIZED and SERIALIZED_TSI *do* preserve the ID, and WebAnno makes used of that. It allows to keep references to FSes without having to keep the CAS in memory all the time. 

There should continue to be a V3 serialization format which preserves IDs across a load/save cycle. 

I do presently not see a case where a strong similarity between V2 and V3 IDs would be important. It would be nice if deserializing a V2 SERIALIZED or SERIALIZED_TSI into V3 would restore the V2 IDs - I expect it to be an easy thing to do.

Cheers,

-- Richard

> On 01.09.2016, at 16:09, Marshall Schor <ms...@schor.com> wrote:
> 
> UIMA V3 implementation includes in many places extra code (takes time / space)
> whose goal is to make things look closer to version 2.  Some of this is for
> interoperability with version 2 artifacts, like serialized forms.
> 
> An example: in v2, many serialization forms include "references" to other
> Feature Structures (FSs), and for those, the encoding is the "address" in the
> heap of the FS.
> 
> In v3, there is no heap, but the FSs have "ids", which are (at the moment) an
> int which increments by 1.  This mis-matches the "address" in v2, so many parts
> of the serialization code builds a map at serialization time from the v3 id's to
> v2 "addresses", and uses the latter in the serialization form.
> 
> Currently, this is done for various binary serializations, so that these can be
> read back in by v2 code.
> 
> Currently, it's not done for JSON or XMI (and maybe XCAS - haven't checked).  So
> the serialized forms for these differ between v2 and v3, in that the numbers
> used to represent references to other FSs are different.
> 
> The deserialization code for XMI and JSON doesn't depend on these numbers being
> anything other than unique per FS, so there's no issue in deserializing.  But
> the UIMA community may have built other things that depend on these identifiers
> not changing. 
> 
> What's your opinion: should the XMI and JSON etc serialization in V3 be changed
> to reproduce (approximately) the same reference numbers as v2?  I say
> approximately, because other factors might affect these, such as the ordering
> for things not in "ordered" indexes, etc. between v2 and v3.
> 
> -Marshall
>