You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Thilo Goetz <tw...@gmx.de> on 2007/10/17 08:26:49 UTC

New CAS heap impl?

I'm thinking about experimenting with alternative heap
implementations in the CAS.  In particular, I would like
to try out a heap impl that uses regular Java objects to
represent feature structures, as opposed to our proprietary
binary heap.

Our current heap design was created when object creation
in Java was very expensive.  I ran experiments at the time
that showed that creating FSs the way we do today was about
twice as fast as creating Java objects.  However, there
are many reasons to run this experiment again today:

 * Object creation in Java is a lot faster today.  The speed
   advantage may be very much reduced, or even gone
   completely.

 * FS creation is not where a typical annotator spends its
   time.  Only for annotators that create a lot of annotations
   with little computation effort (such as tokenizers) is this
   at all significant.

 * Our current heap implementation pre-allocates a lot of
   memory.  This works relatively well for medium size CASes,
   but it has disadvantages both for very small and very
   large CASes.  When using Java objects to represent FSs,
   we leave the memory allocation to the JVM, which seems
   like the right thing to do.

 * We have no garbage collection on the heap.  FSs that are
   once created stay there for the lifetime of the heap.
   This is not a problem for most annotators, but there are
   situations where this behavior is highly undesirable.
   Using Java objects instead, we would benefit from the
   garbage collector of the JVM.

So here's the rub.  Before I even start with this, I would
like to refactor the CAS implementation so I can see what
I'm doing.  The CASImpl class has grown organically for many
years now, and it's due for a major overhaul.  I will not
change any APIs, of course, but I'll probably leave not stone
unturned in the implementation.  Any objections to that?

Secondly, I will need help with the CAS serialization.  The
current binary serialization depends completely on the
heap layout.  Eddie, would you have time to work with me
on that?  I would like to make the serialization independent
of the heap implementation and only rely on the low-level
CAS APIs.  That might be a tiny bit slower (which is still
to be determined), but it will give us better encapsulation
and more flexibility with various heap implementations.

Let me know what you think.

--Thilo

Re: New CAS heap impl?

Posted by Marshall Schor <ms...@schor.com>.

I have no basic objections to refactoring the CAS Impl - but just some
concerns, which are the obvious ones, being things like space and time
impacts, and reliability. 

I was reminded of the importance of these just today when someone at
lunch mentioned they were doing runs to collect statistical info for
further processing, and one run was taking 11 hours. 

I think one of the points which has been an incentive for UIMA adoption
has been the priority it has given to being both space and time
efficient.   The best outcome of your refactoring would be something
that, in addition to making things conceptually clearer, sped it up and
had a smaller footprint ;-)  

Cheers. -Marshall

Thilo Goetz wrote:
> I'm thinking about experimenting with alternative heap
> implementations in the CAS.  In particular, I would like
> to try out a heap impl that uses regular Java objects to
> represent feature structures, as opposed to our proprietary
> binary heap.
>
> Our current heap design was created when object creation
> in Java was very expensive.  I ran experiments at the time
> that showed that creating FSs the way we do today was about
> twice as fast as creating Java objects.  However, there
> are many reasons to run this experiment again today:
>
>  * Object creation in Java is a lot faster today.  The speed
>    advantage may be very much reduced, or even gone
>    completely.
>
>  * FS creation is not where a typical annotator spends its
>    time.  Only for annotators that create a lot of annotations
>    with little computation effort (such as tokenizers) is this
>    at all significant.
>
>  * Our current heap implementation pre-allocates a lot of
>    memory.  This works relatively well for medium size CASes,
>    but it has disadvantages both for very small and very
>    large CASes.  When using Java objects to represent FSs,
>    we leave the memory allocation to the JVM, which seems
>    like the right thing to do.
>
>  * We have no garbage collection on the heap.  FSs that are
>    once created stay there for the lifetime of the heap.
>    This is not a problem for most annotators, but there are
>    situations where this behavior is highly undesirable.
>    Using Java objects instead, we would benefit from the
>    garbage collector of the JVM.
>
> So here's the rub.  Before I even start with this, I would
> like to refactor the CAS implementation so I can see what
> I'm doing.  The CASImpl class has grown organically for many
> years now, and it's due for a major overhaul.  I will not
> change any APIs, of course, but I'll probably leave not stone
> unturned in the implementation.  Any objections to that?
>
> Secondly, I will need help with the CAS serialization.  The
> current binary serialization depends completely on the
> heap layout.  Eddie, would you have time to work with me
> on that?  I would like to make the serialization independent
> of the heap implementation and only rely on the low-level
> CAS APIs.  That might be a tiny bit slower (which is still
> to be determined), but it will give us better encapsulation
> and more flexibility with various heap implementations.
>
> Let me know what you think.
>
> --Thilo
>
>
>

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

On 10/30/07, Thilo Goetz <tw...@gmx.de> wrote:
>
> Eddie Epstein wrote:
> [...]
> > With regards indexing information in the Cas XMI format, what is passed
> > is a list of FS that are indexed [in each view]. Today, without delta
> Cas,
> > the indexes are fully rebuilt when the Cas is returned. All sorted
> indexes
> > will retain the same iteration order. Non-sorted indexes may have a
> > different
> > order, but that has always been documented: "A bag index simply stores
> > everything, without any guaranteed order. "
> >
> > The only potential change in behavior that I am aware of has to do with
> > adding
> > an FS to the index repository multiple times: "... all FSs that are
> commited
> >
> > are entered, even if they are duplicates of already existing FSs." So
> yes,
> > that would be a change in behavior, as there would only be a single
> instance
> > of each FS in the index upon return from a remote component. Is this the
> > difference you were referring to, or is there more?
> >
> > Eddie
> >
>
> Our current XMI implementation and what is on the
> table for OASIS are two different things.  Let me
> quote from the standard proposal:
>
> <quote>
> Currently the Apache UIMA Component Metadata Descriptor includes the
> following
> elements that are not part of the proposed UIMA Specification.
>
> 1. Indexes: Defines the structure of indexes through which the analytic
> will access
> data. In some sense the actual indexing design is an Apache UIMA issue and
> so
> this may be an extension to the descriptor schema that is specific to
> Apache
> UIMA. However if we think of the index definitions as a component
> declaring
> the key features that it is going to use to query the data, we can make a
> case that
> this should be a UIMA standard, so that any framework could optimize based
> on
> this information.
>
> 2. Type Priorities: These are closely related to the index definitions and
> should
> probably be combined with them rather than represented as a separate
> element
>
> </quote>
>
> Maybe I'm wrong, but I think this has consequences for Apache
> UIMA flows that use OASIS compliant services, as indexing
> information is lost.  In Apache UIMA, you explicitly need to
> add FSs to indexes (or not).  This distinction is lost if
> indexes are not part of the spec.
>
> --Thilo
>

We are talking about two different things. Section 5.3.4.2 of the spec
describes how views are to be encoded in the XMI representation of the CAS.
The list of view members is exactly the same information as that in the
earlier XCAS format, indicating which FS have been added to the index
repository for each view. This data preserves the same indexing information
currently in the Vinci and SOAP service interfaces as well as in the JNI
interface for C++ annotators.

The second issue you raise is how indexes themselves (and the related type
priorities) are to be covered by the spec, specifically in the component
metadata descriptors. Index definitions are not in the spec, but they are
not needed to guarantee that the indexing information in Apache UIMA is
preserved after calling XMI services.

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
[...]
> With regards indexing information in the Cas XMI format, what is passed
> is a list of FS that are indexed [in each view]. Today, without delta Cas,
> the indexes are fully rebuilt when the Cas is returned. All sorted indexes
> will retain the same iteration order. Non-sorted indexes may have a
> different
> order, but that has always been documented: "A bag index simply stores
> everything, without any guaranteed order. "
> 
> The only potential change in behavior that I am aware of has to do with
> adding
> an FS to the index repository multiple times: "... all FSs that are commited
> 
> are entered, even if they are duplicates of already existing FSs." So yes,
> that would be a change in behavior, as there would only be a single instance
> of each FS in the index upon return from a remote component. Is this the
> difference you were referring to, or is there more?
> 
> Eddie
> 

Our current XMI implementation and what is on the
table for OASIS are two different things.  Let me
quote from the standard proposal:

<quote>
Currently the Apache UIMA Component Metadata Descriptor includes the following
elements that are not part of the proposed UIMA Specification.

1. Indexes: Defines the structure of indexes through which the analytic will access
data. In some sense the actual indexing design is an Apache UIMA issue and so
this may be an extension to the descriptor schema that is specific to Apache
UIMA. However if we think of the index definitions as a component declaring
the key features that it is going to use to query the data, we can make a case that
this should be a UIMA standard, so that any framework could optimize based on
this information.

2. Type Priorities: These are closely related to the index definitions and should
probably be combined with them rather than represented as a separate element

</quote>

Maybe I'm wrong, but I think this has consequences for Apache
UIMA flows that use OASIS compliant services, as indexing
information is lost.  In Apache UIMA, you explicitly need to
add FSs to indexes (or not).  This distinction is lost if
indexes are not part of the spec.

--Thilo

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

Hi Thilo,

On 10/29/07, Thilo Goetz <tw...@gmx.de> wrote:
>
> Hi Eddie,
>
> Eddie Epstein wrote:
> >> It doesn't seem intuitive to me that an object reference whose
> >> underlying object may have been serialized, sent over the network
> >> or to C++, modified, serialized again and sent back is guaranteed
> >> to still be valid afterwards.  It makes sense that this should
> >> work when all annotators are local.  I don't think it makes sense
> >> to guarantee this behavior in general.
> >>
> > The fact that this works for services and C++ annotators is not by
> > accident, it is because a lot of effort was put in to make it work.
>
> I know, I wrote the first version of that code (together with Oli
> Suhre).
>
> > At issue here is the vision for UIMA with regards how much flexibility
> > to have in deploying annotators without affecting application behavior.
> >
> > A strong point point for UIMA, particularly with the OASIS standards
> > work, is that UIMA annotators can be externalized and implemented
> > in any language. It would be nice if the Apache UIMA implementation
> > would not penalize applications for using those annotators.
> >
> > Eddie
> >
>
> I can see that this point is very important to you.  I would
> have thought that the original point we were debating was pretty
> minor, and with proper documentation, should cause no problems
> for anyone.  However, I understand you see things differently.
>
> It will be interesting to see what repercussions the OASIS
> standard has on such issues.  For example, indexing as we use
> it today in Apache UIMA is not part of the standard atm.  So indexing
> information is lost in translation.  This means that potentially,
> when a flow includes a call to a OASIS compatible annotator, indexing
> info and thus annotation iteration will change.  Now maybe we
> will want to change the way indexing works in Apache UIMA in
> response to this, but I don't see how we can do this while staying
> backward compatible.  I'd be interested to know what your take
> is on this issue, as you're one of the authors of the initial OASIS
> submission.  (Not to mention type priorities, but I'll be glad to
> see them go ;-)
>
> --Thilo
>
>
Well, it should be expected that such a change, reimplementing FS
storage, would have more ramifications than what is immediately obvious.
And yes, having spent much time now working on flexible and scalable
deployment options for UIMA annotators, I am quite keen on having
consistent behavior for co-located and remote configurations.

With regards indexing information in the Cas XMI format, what is passed
is a list of FS that are indexed [in each view]. Today, without delta Cas,
the indexes are fully rebuilt when the Cas is returned. All sorted indexes
will retain the same iteration order. Non-sorted indexes may have a
different
order, but that has always been documented: "A bag index simply stores
everything, without any guaranteed order. "

The only potential change in behavior that I am aware of has to do with
adding
an FS to the index repository multiple times: "... all FSs that are commited

are entered, even if they are duplicates of already existing FSs." So yes,
that would be a change in behavior, as there would only be a single instance
of each FS in the index upon return from a remote component. Is this the
difference you were referring to, or is there more?

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Hi Eddie,

Eddie Epstein wrote:
>> It doesn't seem intuitive to me that an object reference whose
>> underlying object may have been serialized, sent over the network
>> or to C++, modified, serialized again and sent back is guaranteed
>> to still be valid afterwards.  It makes sense that this should
>> work when all annotators are local.  I don't think it makes sense
>> to guarantee this behavior in general.
>>
> The fact that this works for services and C++ annotators is not by
> accident, it is because a lot of effort was put in to make it work.

I know, I wrote the first version of that code (together with Oli
Suhre).

> At issue here is the vision for UIMA with regards how much flexibility
> to have in deploying annotators without affecting application behavior.
> 
> A strong point point for UIMA, particularly with the OASIS standards
> work, is that UIMA annotators can be externalized and implemented
> in any language. It would be nice if the Apache UIMA implementation
> would not penalize applications for using those annotators.
> 
> Eddie
> 

I can see that this point is very important to you.  I would
have thought that the original point we were debating was pretty
minor, and with proper documentation, should cause no problems
for anyone.  However, I understand you see things differently.

It will be interesting to see what repercussions the OASIS
standard has on such issues.  For example, indexing as we use
it today in Apache UIMA is not part of the standard atm.  So indexing
information is lost in translation.  This means that potentially,
when a flow includes a call to a OASIS compatible annotator, indexing
info and thus annotation iteration will change.  Now maybe we
will want to change the way indexing works in Apache UIMA in
response to this, but I don't see how we can do this while staying
backward compatible.  I'd be interested to know what your take
is on this issue, as you're one of the authors of the initial OASIS
submission.  (Not to mention type priorities, but I'll be glad to
see them go ;-)

--Thilo

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

>
> It doesn't seem intuitive to me that an object reference whose
> underlying object may have been serialized, sent over the network
> or to C++, modified, serialized again and sent back is guaranteed
> to still be valid afterwards.  It makes sense that this should
> work when all annotators are local.  I don't think it makes sense
> to guarantee this behavior in general.
>
> The fact that this works for services and C++ annotators is not by
accident, it is because a lot of effort was put in to make it work.
At issue here is the vision for UIMA with regards how much flexibility
to have in deploying annotators without affecting application behavior.

A strong point point for UIMA, particularly with the OASIS standards
work, is that UIMA annotators can be externalized and implemented
in any language. It would be nice if the Apache UIMA implementation
would not penalize applications for using those annotators.

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Adam Lally wrote:
> On 10/22/07, Thilo Goetz <tw...@gmx.de> wrote:
>> Eddie Epstein wrote:
>>> Consider the following code:
>>>         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
>>>         CAS cas = ae.newCAS();
>>>         cas.setDocumentText("some text");
>>>         AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(), 0,
>>> 4);
>>>         ae.process(cas);
>>>         System.out.println(fs.getCoveredText());
>>>
>>> Preexisting fs in the client must be valid after a process call, no?
>> No.  I've been over this with Adam on one of the OASIS calls, too.
> 
> I sort of remember discussing this but can't remember ever coming to a
> conclusion.
> 
> It seems to me that this code ought to work.  To me it makes sense
> that we say that an _annotator_ shouldn't retain references to FS
> across process calls, since the CAS represents a different document
> each time.  But I think it's entirely another thing so say that an
> application can't keep references to FS when it calls an AE's process
> method.  It doesn't seem very intuitive to me that the object
> reference would be invalidated.  In any case I don't think we
> sufficiently warn our users against this, and since it "happens to
> work" they may not be happy if we change it.
> 
> If we're going to vote on it, maybe we solicit user opinions too.
> 
> -Adam

It doesn't seem intuitive to me that an object reference whose
underlying object may have been serialized, sent over the network
or to C++, modified, serialized again and sent back is guaranteed
to still be valid afterwards.  It makes sense that this should
work when all annotators are local.  I don't think it makes sense
to guarantee this behavior in general.

The window of opportunity where I could have worked on this is
closing rapidly.  Still, it might make sense to come to a conclusion
in this matter in case it comes up again.

--Thilo

Re: New CAS heap impl?

Posted by Adam Lally <al...@alum.rpi.edu>.

On 10/22/07, Thilo Goetz <tw...@gmx.de> wrote:
> Eddie Epstein wrote:
> > Consider the following code:
> >         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
> >         CAS cas = ae.newCAS();
> >         cas.setDocumentText("some text");
> >         AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(), 0,
> > 4);
> >         ae.process(cas);
> >         System.out.println(fs.getCoveredText());
> >
> > Preexisting fs in the client must be valid after a process call, no?
>
> No.  I've been over this with Adam on one of the OASIS calls, too.

I sort of remember discussing this but can't remember ever coming to a
conclusion.

It seems to me that this code ought to work.  To me it makes sense
that we say that an _annotator_ shouldn't retain references to FS
across process calls, since the CAS represents a different document
each time.  But I think it's entirely another thing so say that an
application can't keep references to FS when it calls an AE's process
method.  It doesn't seem very intuitive to me that the object
reference would be invalidated.  In any case I don't think we
sufficiently warn our users against this, and since it "happens to
work" they may not be happy if we change it.

If we're going to vote on it, maybe we solicit user opinions too.

-Adam

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

>
> >
> > Doing this in the serialization code will not work. There is no way for
> this
> > to efficiently detect which existing FS have had feature values changed.
> > More importantly, it eliminates the ability to track CAS changes for
> > colocated annotators, something that has been repeated asked for to
> improve
> > debugging and to track provenance.
>
> Now wait a minute.  The current heap implementation can't
> do that either.  All we were talking about was to know which
> FSs were *added* since the CAS was serialized.  That is
> something you can do now by remembering the top heap position,
> and I am planning to support this with the new heap impl as
> well.  Knowing what FSs were *modified* is an entirely different
> proposition.


Right, recording the fact that old FS have been modified will require
changes. The ability to recognize old FS quickly is key, thanks. I was
mainly commenting that serialization was not a good place to do this stuff.


>
> >> Given no warning against doing this from an application, the fact that
> it
> >>> works and that it is fairly intuitive to do so means that there are
> >> likely
> >>> existing UIMA applications doing it. Of course we all are willing to
> >> break
> >>> existing user code when it gets in the way of some neat improvement :)
> >> So you agree that maintaining this behavior is not a requirement?
> >
> >
> > No, not without further discussion.
>
> Maybe we should call for a vote?


Sure. What exactly are voting for, breaking this just for remote annotators,
or for all annotators?


>
> >> Blob serialization, like the binary serialization used between C++ and
> >> Java,
> >>> leaves the Java Cas with a string heap rather than a string list. It
> >> would
> >>> be easy to change blob deserialization to recreate a string list
> >> instead,
> >>> and measure the performance difference.
> >> I'll take your word for it, though I still don't see what this
> >> has to do with what we were talking about.  In the new heap I'm
> >> thinking about, there will be no such thing as a String heap or
> >> list.  Strings will just be referenced directly from the objects
> >> representing FSs.
> >>
> >
> > It sounds like you have no concern for binary serialization performance.
>
> I don't know what makes you say that.  That is not the
> impression I wanted to give, at least ;-)  I'll admit
> it's not my primary concern.  To repeat: I simply do not
> understand what you mean to show by your string heap vs.
> string list test.  I'm not unwilling, just intellectually
> incapable.


My concern is that deserializing FS into a single int array is much faster
than creating individual Java objects for each FS; same for strings, so
doing a simple experiment with strings would be relevant. Maybe I am
completely confused?


> Changing the heap design to enable garbage collection at the expense of
> > seriously degrading performance for existing users that are strongly
> > dependent on efficient CAS serialization does not sound viable.
>
> I agree completely.  If this turns out to seriously degrade
> performance for *any* important scenario, it's out.  However,
> I'm not sure it will degrade performance, not even for binary
> serialization.  Otherwise I wouldn't be suggesting this.
>

Oh good, my worries are over :)

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
>>> Copying the behavior would be appropriate, unless there is some other
>> way to
>>> easily distinguish pre-existing FS.
>> To my mind, the place to keep track of something like that
>> is the serialization code.  It has to iterate over the whole
>> CAS anyway and can do that kind of tracking.  It seems wrong
>> to put that kind of requirement on the heap implementation.s.
> 
> 
> Doing this in the serialization code will not work. There is no way for this
> to efficiently detect which existing FS have had feature values changed.
> More importantly, it eliminates the ability to track CAS changes for
> colocated annotators, something that has been repeated asked for to improve
> debugging and to track provenance.

Now wait a minute.  The current heap implementation can't
do that either.  All we were talking about was to know which
FSs were *added* since the CAS was serialized.  That is
something you can do now by remembering the top heap position,
and I am planning to support this with the new heap impl as
well.  Knowing what FSs were *modified* is an entirely different
proposition.

> 
>> Given no warning against doing this from an application, the fact that it
>>> works and that it is fairly intuitive to do so means that there are
>> likely
>>> existing UIMA applications doing it. Of course we all are willing to
>> break
>>> existing user code when it gets in the way of some neat improvement :)
>> So you agree that maintaining this behavior is not a requirement?
> 
> 
> No, not without further discussion.

Maybe we should call for a vote?

> 
>> Blob serialization, like the binary serialization used between C++ and
>> Java,
>>> leaves the Java Cas with a string heap rather than a string list. It
>> would
>>> be easy to change blob deserialization to recreate a string list
>> instead,
>>> and measure the performance difference.
>> I'll take your word for it, though I still don't see what this
>> has to do with what we were talking about.  In the new heap I'm
>> thinking about, there will be no such thing as a String heap or
>> list.  Strings will just be referenced directly from the objects
>> representing FSs.
>>
> 
> It sounds like you have no concern for binary serialization performance.

I don't know what makes you say that.  That is not the
impression I wanted to give, at least ;-)  I'll admit
it's not my primary concern.  To repeat: I simply do not
understand what you mean to show by your string heap vs.
string list test.  I'm not unwilling, just intellectually
incapable.

> Changing the heap design to enable garbage collection at the expense of
> seriously degrading performance for existing users that are strongly
> dependent on efficient CAS serialization does not sound viable.

I agree completely.  If this turns out to seriously degrade
performance for *any* important scenario, it's out.  However,
I'm not sure it will degrade performance, not even for binary
serialization.  Otherwise I wouldn't be suggesting this.

--Thilo

> 
> How about re-implementing the heap as a pluggable component so that the
> existing design would still be available?
> 
> Eddie
>

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

>
> > Copying the behavior would be appropriate, unless there is some other
> way to
> > easily distinguish pre-existing FS.
>
> To my mind, the place to keep track of something like that
> is the serialization code.  It has to iterate over the whole
> CAS anyway and can do that kind of tracking.  It seems wrong
> to put that kind of requirement on the heap implementation.s.


Doing this in the serialization code will not work. There is no way for this
to efficiently detect which existing FS have had feature values changed.
More importantly, it eliminates the ability to track CAS changes for
colocated annotators, something that has been repeated asked for to improve
debugging and to track provenance.

> Given no warning against doing this from an application, the fact that it
> > works and that it is fairly intuitive to do so means that there are
> likely
> > existing UIMA applications doing it. Of course we all are willing to
> break
> > existing user code when it gets in the way of some neat improvement :)
>
> So you agree that maintaining this behavior is not a requirement?


No, not without further discussion.

> Blob serialization, like the binary serialization used between C++ and
> Java,
> > leaves the Java Cas with a string heap rather than a string list. It
> would
> > be easy to change blob deserialization to recreate a string list
> instead,
> > and measure the performance difference.
>
> I'll take your word for it, though I still don't see what this
> has to do with what we were talking about.  In the new heap I'm
> thinking about, there will be no such thing as a String heap or
> list.  Strings will just be referenced directly from the objects
> representing FSs.
>

It sounds like you have no concern for binary serialization performance.
Changing the heap design to enable garbage collection at the expense of
seriously degrading performance for existing users that are strongly
dependent on efficient CAS serialization does not sound viable.

How about re-implementing the heap as a pluggable component so that the
existing design would still be available?

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
[...]
>>> With the current design, the top of the FS heap position on calling
>> process
>>> is used to identify new versus preexisting FS during or after the call:
>> just
>>> compare any FS address to that position to know if it is new or not.
>> I can copy this behavior in the new implementation, but
>> do we really want to rely on this and make it part of the
>> design of the CAS and its heap?  Currently, this is a property
>> of the implementation, but not something I ever considered
>> to be part of the external contract of the CAS implementation.
>>
>> It only works because the heap doesn't do any garbage collection,
>> and consequently no heap compaction.  It's not like that because
>> I thought that was a particularly good idea, but simply because
>> it would have been difficult to implement.  So it's a restriction
>> of the implementation, and not something to be necessarily
>> preserve in the future.
> 
> 
> Copying the behavior would be appropriate, unless there is some other way to
> easily distinguish pre-existing FS.

To my mind, the place to keep track of something like that
is the serialization code.  It has to iterate over the whole
CAS anyway and can do that kind of tracking.  It seems wrong
to put that kind of requirement on the heap implementation.

With the new kind of implementation I have in mind, this
information will still be available.  For future development,
it would be better not to rely on heap implementation details.

> 
> 
>>> Consider the following code:
>>>         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine
>> (specifier);
>>>         CAS cas = ae.newCAS();
>>>         cas.setDocumentText("some text");
>>>         AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(),
>> 0,
>>> 4);
>>>         ae.process(cas);
>>>         System.out.println(fs.getCoveredText());
>>>
>>> Preexisting fs in the client must be valid after a process call, no?
>> No.  I've been over this with Adam on one of the OASIS calls, too.
>> It happens to work in the current implementation, but nowhere do
>> we guarantee this or suggest that this should work.  To the contrary,
>> we always tell people not to keep FS references across process calls.
>> The design I am planning on may break this code.  I will guarantee
>> that int IDs of FSs are constant for serialization/deserialization,
>> but I won't necessarily keep the objects around.  So if the CAS was
>> sent over the wire, the object may no longer be valid.  If the
>> deployment is all local, it will continue to work (unless the FS
>> has been deleted by one of the annotators).
> 
> 
> Changing behavior for remote versus colocated annotators is not a good idea.
> As for telling people not to keep application references, the only
> documentation we have for that [that I have seen] has to do with code inside
> an annotator process method. Specifically:
> 
> The JCas will be cleared between calls to your annotator's process() method.
> All of the
> analysis results related to the previous document will be deleted to make
> way for analysis
> of a new document. Therefore, you should never save a reference to a JCas
> Feature
> Structure object (i.e. an instance of a class created using JCasGen) and
> attempt to reuse it
> in a future invocation of the process() method. If you do so, the results
> will be undefined.
> 
> Given no warning against doing this from an application, the fact that it
> works and that it is fairly intuitive to do so means that there are likely
> existing UIMA applications doing it. Of course we all are willing to break
> existing user code when it gets in the way of some neat improvement :)

So you agree that maintaining this behavior is not a requirement?

> 
> It was the second paragraph that I didn't understand.
> 
> Blob serialization, like the binary serialization used between C++ and Java,
> leaves the Java Cas with a string heap rather than a string list. It would
> be easy to change blob deserialization to recreate a string list instead,
> and measure the performance difference.

I'll take your word for it, though I still don't see what this
has to do with what we were talking about.  In the new heap I'm
thinking about, there will be no such thing as a String heap or
list.  Strings will just be referenced directly from the objects
representing FSs.

--Thilo

> 
> Eddie
>

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

>
> > With the current design, the top of the FS heap position on calling
> process
> > is used to identify new versus preexisting FS during or after the call:
> just
> > compare any FS address to that position to know if it is new or not.
>
> I can copy this behavior in the new implementation, but
> do we really want to rely on this and make it part of the
> design of the CAS and its heap?  Currently, this is a property
> of the implementation, but not something I ever considered
> to be part of the external contract of the CAS implementation.
>
> It only works because the heap doesn't do any garbage collection,
> and consequently no heap compaction.  It's not like that because
> I thought that was a particularly good idea, but simply because
> it would have been difficult to implement.  So it's a restriction
> of the implementation, and not something to be necessarily
> preserve in the future.


Copying the behavior would be appropriate, unless there is some other way to
easily distinguish pre-existing FS.


>
> > Consider the following code:
> >         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine
> (specifier);
> >         CAS cas = ae.newCAS();
> >         cas.setDocumentText("some text");
> >         AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(),
> 0,
> > 4);
> >         ae.process(cas);
> >         System.out.println(fs.getCoveredText());
> >
> > Preexisting fs in the client must be valid after a process call, no?
>
> No.  I've been over this with Adam on one of the OASIS calls, too.
> It happens to work in the current implementation, but nowhere do
> we guarantee this or suggest that this should work.  To the contrary,
> we always tell people not to keep FS references across process calls.
> The design I am planning on may break this code.  I will guarantee
> that int IDs of FSs are constant for serialization/deserialization,
> but I won't necessarily keep the objects around.  So if the CAS was
> sent over the wire, the object may no longer be valid.  If the
> deployment is all local, it will continue to work (unless the FS
> has been deleted by one of the annotators).


Changing behavior for remote versus colocated annotators is not a good idea.
As for telling people not to keep application references, the only
documentation we have for that [that I have seen] has to do with code inside
an annotator process method. Specifically:

The JCas will be cleared between calls to your annotator's process() method.
All of the
analysis results related to the previous document will be deleted to make
way for analysis
of a new document. Therefore, you should never save a reference to a JCas
Feature
Structure object (i.e. an instance of a class created using JCasGen) and
attempt to reuse it
in a future invocation of the process() method. If you do so, the results
will be undefined.

Given no warning against doing this from an application, the fact that it
works and that it is fairly intuitive to do so means that there are likely
existing UIMA applications doing it. Of course we all are willing to break
existing user code when it gets in the way of some neat improvement :)

It was the second paragraph that I didn't understand.
>

Blob serialization, like the binary serialization used between C++ and Java,
leaves the Java Cas with a string heap rather than a string list. It would
be easy to change blob deserialization to recreate a string list instead,
and measure the performance difference.

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
> On 10/19/07, Thilo Goetz <tw...@gmx.de> wrote:
>>> As far as I know, the main requirements for delta CAS is that it is easy
>> (
>>> i.e. cheap) to know,
>>>  1. which FS were created in the current call
>>>  2. which preexisting FS were deleted from the index
>>>  3. when setting a feature value, if the containing FS was preexisting
>> None of these are particularly easy to do now, and they
>> won't be any easier or harder when I'm done ;-)  As I said,
>> there will still be unique IDs, and as long as you don't
>> refer to the heap directly, my changes should not affect
>> this design.
> 
> 
> With the current design, the top of the FS heap position on calling process
> is used to identify new versus preexisting FS during or after the call: just
> compare any FS address to that position to know if it is new or not.

I can copy this behavior in the new implementation, but
do we really want to rely on this and make it part of the
design of the CAS and its heap?  Currently, this is a property
of the implementation, but not something I ever considered
to be part of the external contract of the CAS implementation.

It only works because the heap doesn't do any garbage collection,
and consequently no heap compaction.  It's not like that because
I thought that was a particularly good idea, but simply because
it would have been difficult to implement.  So it's a restriction
of the implementation, and not something to be necessarily
preserve in the future.

> 
>>> Another thing to keep in mind for calls to remote services is the
>>> requirement that any FS references in the client are still valid after
>>> making a call.
>>>
>>> As for impact on binary serialization performance, an easy experiment
>> would
>>> be to modify binary serialization to end up with a string list instead
>> of a
>>> string heap, using a scenario that had a lot of strings in the CAS. This
>>> would give a good idea of the extra overhead of creating individual FS
>>> objects.
>> I must admit that I don't understand what you mean.
>>
>> For both paragraphs?
> 
> Consider the following code:
>         AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
>         CAS cas = ae.newCAS();
>         cas.setDocumentText("some text");
>         AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(), 0,
> 4);
>         ae.process(cas);
>         System.out.println(fs.getCoveredText());
> 
> Preexisting fs in the client must be valid after a process call, no?

No.  I've been over this with Adam on one of the OASIS calls, too.
It happens to work in the current implementation, but nowhere do
we guarantee this or suggest that this should work.  To the contrary,
we always tell people not to keep FS references across process calls.
The design I am planning on may break this code.  I will guarantee
that int IDs of FSs are constant for serialization/deserialization,
but I won't necessarily keep the objects around.  So if the CAS was
sent over the wire, the object may no longer be valid.  If the
deployment is all local, it will continue to work (unless the FS
has been deleted by one of the annotators).

> 
> For the 2nd paragraph, I was referring to binary blob serialization.
> 
> Eddie
> 

It was the second paragraph that I didn't understand.

--Thilo

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

On 10/19/07, Thilo Goetz <tw...@gmx.de> wrote:
>
> >
> > As far as I know, the main requirements for delta CAS is that it is easy
> (
> > i.e. cheap) to know,
> >  1. which FS were created in the current call
> >  2. which preexisting FS were deleted from the index
> >  3. when setting a feature value, if the containing FS was preexisting
>
> None of these are particularly easy to do now, and they
> won't be any easier or harder when I'm done ;-)  As I said,
> there will still be unique IDs, and as long as you don't
> refer to the heap directly, my changes should not affect
> this design.


With the current design, the top of the FS heap position on calling process
is used to identify new versus preexisting FS during or after the call: just
compare any FS address to that position to know if it is new or not.

>
> > Another thing to keep in mind for calls to remote services is the
> > requirement that any FS references in the client are still valid after
> > making a call.
> >
> > As for impact on binary serialization performance, an easy experiment
> would
> > be to modify binary serialization to end up with a string list instead
> of a
> > string heap, using a scenario that had a lot of strings in the CAS. This
> > would give a good idea of the extra overhead of creating individual FS
> > objects.
>
> I must admit that I don't understand what you mean.
>
> For both paragraphs?

Consider the following code:
        AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
        CAS cas = ae.newCAS();
        cas.setDocumentText("some text");
        AnnotationFS fs = cas.createAnnotation(cas.getAnnotationType(), 0,
4);
        ae.process(cas);
        System.out.println(fs.getCoveredText());

Preexisting fs in the client must be valid after a process call, no?

For the 2nd paragraph, I was referring to binary blob serialization.

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
> On 10/18/07, Thilo Goetz <tw...@gmx.de> wrote:
>>> Perhaps "requirements" was the wrong word here; "opportunities for
>>> improvement" would be better. The envisioned implementations have no
>>> dramatic impact on existing design. But as long as we are discussing,
>> any
>>> comments?
>>>
>>> Eddie
>>>
>> If by "existing design" you don't mean CAS internal
>> implementation details, that's fine with me.  I'm
>> personally not very interested in sending XML serializations
>> of the CAS around the network.  If what you're proposing
>> affects my ability to change the CAS implementation,
>> then I will have some comments.
>>
>> --Thilo
>>
> 
> As far as I know, the main requirements for delta CAS is that it is easy (
> i.e. cheap) to know,
>  1. which FS were created in the current call
>  2. which preexisting FS were deleted from the index
>  3. when setting a feature value, if the containing FS was preexisting

None of these are particularly easy to do now, and they
won't be any easier or harder when I'm done ;-)  As I said,
there will still be unique IDs, and as long as you don't
refer to the heap directly, my changes should not affect
this design.

> 
> Another thing to keep in mind for calls to remote services is the
> requirement that any FS references in the client are still valid after
> making a call.
> 
> As for impact on binary serialization performance, an easy experiment would
> be to modify binary serialization to end up with a string list instead of a
> string heap, using a scenario that had a lot of strings in the CAS. This
> would give a good idea of the extra overhead of creating individual FS
> objects.

I must admit that I don't understand what you mean.

> 
> Eddie
>

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

On 10/18/07, Thilo Goetz <tw...@gmx.de> wrote:
>
> > Perhaps "requirements" was the wrong word here; "opportunities for
> > improvement" would be better. The envisioned implementations have no
> > dramatic impact on existing design. But as long as we are discussing,
> any
> > comments?
> >
> > Eddie
> >
>
> If by "existing design" you don't mean CAS internal
> implementation details, that's fine with me.  I'm
> personally not very interested in sending XML serializations
> of the CAS around the network.  If what you're proposing
> affects my ability to change the CAS implementation,
> then I will have some comments.
>
> --Thilo
>

As far as I know, the main requirements for delta CAS is that it is easy (
i.e. cheap) to know,
 1. which FS were created in the current call
 2. which preexisting FS were deleted from the index
 3. when setting a feature value, if the containing FS was preexisting

Another thing to keep in mind for calls to remote services is the
requirement that any FS references in the client are still valid after
making a call.

As for impact on binary serialization performance, an easy experiment would
be to modify binary serialization to end up with a string list instead of a
string heap, using a scenario that had a lot of strings in the CAS. This
would give a good idea of the extra overhead of creating individual FS
objects.

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
> On 10/18/07, Thilo Goetz <tw...@gmx.de> wrote:
>>
>>> merging for parallel processing steps, and will be used to implement a
>>> delta-CAS transport model that only sends out to services the data they
>>> require and only sends back new and modified data. The same design used
>> for
>>> delta CAS will allow us to give users details on CAS changes for every
>>> processing step, when desired for debugging, with little or essentially
>> no
>>> extra overhead. These requirements may be easily handled in a new CAS
>>> design, but we should take them into account in the redesign process,
>> not
>>> after an implementation.
>> Maybe these requirements should be discussed here first
>> before we assume they will be implemented?
>>
>>
> Perhaps "requirements" was the wrong word here; "opportunities for
> improvement" would be better. The envisioned implementations have no
> dramatic impact on existing design. But as long as we are discussing, any
> comments?
> 
> Eddie
> 

If by "existing design" you don't mean CAS internal
implementation details, that's fine with me.  I'm
personally not very interested in sending XML serializations
of the CAS around the network.  If what you're proposing
affects my ability to change the CAS implementation,
then I will have some comments.

--Thilo

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

On 10/18/07, Thilo Goetz <tw...@gmx.de> wrote:
>
>
> > merging for parallel processing steps, and will be used to implement a
> > delta-CAS transport model that only sends out to services the data they
> > require and only sends back new and modified data. The same design used
> for
> > delta CAS will allow us to give users details on CAS changes for every
> > processing step, when desired for debugging, with little or essentially
> no
> > extra overhead. These requirements may be easily handled in a new CAS
> > design, but we should take them into account in the redesign process,
> not
> > after an implementation.
>
> Maybe these requirements should be discussed here first
> before we assume they will be implemented?
>
>
Perhaps "requirements" was the wrong word here; "opportunities for
improvement" would be better. The envisioned implementations have no
dramatic impact on existing design. But as long as we are discussing, any
comments?

Eddie

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Hi Eddie,

Eddie Epstein wrote:
> Hi Thilo,
> 
> In addition to the impact on binary serialization performance, there will
> also be XMI serialization issues. Heap location is currently used in CAS

Not to worry.  I know I'll need to continue to support the low-level
CAS APIs, and with that, int object IDs.  So XMI serialization and
deserialization should be fine.  The only difference is that
those IDs will no longer represent heap offsets, which will make
serialization to C++ more difficult.

> merging for parallel processing steps, and will be used to implement a
> delta-CAS transport model that only sends out to services the data they
> require and only sends back new and modified data. The same design used for
> delta CAS will allow us to give users details on CAS changes for every
> processing step, when desired for debugging, with little or essentially no
> extra overhead. These requirements may be easily handled in a new CAS
> design, but we should take them into account in the redesign process, not
> after an implementation.

Maybe these requirements should be discussed here first
before we assume they will be implemented?

> 
> On the other hand, all serialization issues can be ignored in order to do
> some performance testing with a new design. Java object creation may be much
> faster, but there may be other issues. For example, my understanding is that
> there is a fairly significant memory overhead per Java object that may
> increase overall CAS space requirements, at least in some circumstances.

We'll have to see about that.  If we can manage to have JCas
objects *be* those objects, there will be very little overhead.

> 
> I'd be happy to participate in discussions, but will be unable to contribute
> to coding for at least a couple months.
> 
> Eddie
> 
> On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
>> I'm thinking about experimenting with alternative heap
>> implementations in the CAS.  In particular, I would like
>> to try out a heap impl that uses regular Java objects to
>> represent feature structures, as opposed to our proprietary
>> binary heap.
>>
>> Our current heap design was created when object creation
>> in Java was very expensive.  I ran experiments at the time
>> that showed that creating FSs the way we do today was about
>> twice as fast as creating Java objects.  However, there
>> are many reasons to run this experiment again today:
>>
>> * Object creation in Java is a lot faster today.  The speed
>>    advantage may be very much reduced, or even gone
>>    completely.
>>
>> * FS creation is not where a typical annotator spends its
>>    time.  Only for annotators that create a lot of annotations
>>    with little computation effort (such as tokenizers) is this
>>    at all significant.
>>
>> * Our current heap implementation pre-allocates a lot of
>>    memory.  This works relatively well for medium size CASes,
>>    but it has disadvantages both for very small and very
>>    large CASes.  When using Java objects to represent FSs,
>>    we leave the memory allocation to the JVM, which seems
>>    like the right thing to do.
>>
>> * We have no garbage collection on the heap.  FSs that are
>>    once created stay there for the lifetime of the heap.
>>    This is not a problem for most annotators, but there are
>>    situations where this behavior is highly undesirable.
>>    Using Java objects instead, we would benefit from the
>>    garbage collector of the JVM.
>>
>> So here's the rub.  Before I even start with this, I would
>> like to refactor the CAS implementation so I can see what
>> I'm doing.  The CASImpl class has grown organically for many
>> years now, and it's due for a major overhaul.  I will not
>> change any APIs, of course, but I'll probably leave not stone
>> unturned in the implementation.  Any objections to that?
>>
>> Secondly, I will need help with the CAS serialization.  The
>> current binary serialization depends completely on the
>> heap layout.  Eddie, would you have time to work with me
>> on that?  I would like to make the serialization independent
>> of the heap implementation and only rely on the low-level
>> CAS APIs.  That might be a tiny bit slower (which is still
>> to be determined), but it will give us better encapsulation
>> and more flexibility with various heap implementations.
>>
>> Let me know what you think.
>>
>> --Thilo
>>
>

Re: New CAS heap impl?

Posted by Marshall Schor <ms...@schor.com>.

I recall that an object in Java takes somewhere between 16 and 24 bytes
overhead, in addition to the data for the object itself, on a 32-bit
java system, depending on the particular JVM (IBM vs. SUN, 1.4, 5, 6).

There are some notes in the dev list around the redesign for the hashmap
used by JCas to map fs addresses to objects.

-Marshall

Eddie Epstein wrote:
> Hi Thilo,
>
> In addition to the impact on binary serialization performance, there will
> also be XMI serialization issues. Heap location is currently used in CAS
> merging for parallel processing steps, and will be used to implement a
> delta-CAS transport model that only sends out to services the data they
> require and only sends back new and modified data. The same design used for
> delta CAS will allow us to give users details on CAS changes for every
> processing step, when desired for debugging, with little or essentially no
> extra overhead. These requirements may be easily handled in a new CAS
> design, but we should take them into account in the redesign process, not
> after an implementation.
>
> On the other hand, all serialization issues can be ignored in order to do
> some performance testing with a new design. Java object creation may be much
> faster, but there may be other issues. For example, my understanding is that
> there is a fairly significant memory overhead per Java object that may
> increase overall CAS space requirements, at least in some circumstances.
>
> I'd be happy to participate in discussions, but will be unable to contribute
> to coding for at least a couple months.
>
> Eddie
>
> On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
>   
>> I'm thinking about experimenting with alternative heap
>> implementations in the CAS.  In particular, I would like
>> to try out a heap impl that uses regular Java objects to
>> represent feature structures, as opposed to our proprietary
>> binary heap.
>>
>> Our current heap design was created when object creation
>> in Java was very expensive.  I ran experiments at the time
>> that showed that creating FSs the way we do today was about
>> twice as fast as creating Java objects.  However, there
>> are many reasons to run this experiment again today:
>>
>> * Object creation in Java is a lot faster today.  The speed
>>    advantage may be very much reduced, or even gone
>>    completely.
>>
>> * FS creation is not where a typical annotator spends its
>>    time.  Only for annotators that create a lot of annotations
>>    with little computation effort (such as tokenizers) is this
>>    at all significant.
>>
>> * Our current heap implementation pre-allocates a lot of
>>    memory.  This works relatively well for medium size CASes,
>>    but it has disadvantages both for very small and very
>>    large CASes.  When using Java objects to represent FSs,
>>    we leave the memory allocation to the JVM, which seems
>>    like the right thing to do.
>>
>> * We have no garbage collection on the heap.  FSs that are
>>    once created stay there for the lifetime of the heap.
>>    This is not a problem for most annotators, but there are
>>    situations where this behavior is highly undesirable.
>>    Using Java objects instead, we would benefit from the
>>    garbage collector of the JVM.
>>
>> So here's the rub.  Before I even start with this, I would
>> like to refactor the CAS implementation so I can see what
>> I'm doing.  The CASImpl class has grown organically for many
>> years now, and it's due for a major overhaul.  I will not
>> change any APIs, of course, but I'll probably leave not stone
>> unturned in the implementation.  Any objections to that?
>>
>> Secondly, I will need help with the CAS serialization.  The
>> current binary serialization depends completely on the
>> heap layout.  Eddie, would you have time to work with me
>> on that?  I would like to make the serialization independent
>> of the heap implementation and only rely on the low-level
>> CAS APIs.  That might be a tiny bit slower (which is still
>> to be determined), but it will give us better encapsulation
>> and more flexibility with various heap implementations.
>>
>> Let me know what you think.
>>
>> --Thilo
>>
>>     
>
>

Re: New CAS heap impl?

Posted by Eddie Epstein <ea...@gmail.com>.

Hi Thilo,

In addition to the impact on binary serialization performance, there will
also be XMI serialization issues. Heap location is currently used in CAS
merging for parallel processing steps, and will be used to implement a
delta-CAS transport model that only sends out to services the data they
require and only sends back new and modified data. The same design used for
delta CAS will allow us to give users details on CAS changes for every
processing step, when desired for debugging, with little or essentially no
extra overhead. These requirements may be easily handled in a new CAS
design, but we should take them into account in the redesign process, not
after an implementation.

On the other hand, all serialization issues can be ignored in order to do
some performance testing with a new design. Java object creation may be much
faster, but there may be other issues. For example, my understanding is that
there is a fairly significant memory overhead per Java object that may
increase overall CAS space requirements, at least in some circumstances.

I'd be happy to participate in discussions, but will be unable to contribute
to coding for at least a couple months.

Eddie

On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
>
> I'm thinking about experimenting with alternative heap
> implementations in the CAS.  In particular, I would like
> to try out a heap impl that uses regular Java objects to
> represent feature structures, as opposed to our proprietary
> binary heap.
>
> Our current heap design was created when object creation
> in Java was very expensive.  I ran experiments at the time
> that showed that creating FSs the way we do today was about
> twice as fast as creating Java objects.  However, there
> are many reasons to run this experiment again today:
>
> * Object creation in Java is a lot faster today.  The speed
>    advantage may be very much reduced, or even gone
>    completely.
>
> * FS creation is not where a typical annotator spends its
>    time.  Only for annotators that create a lot of annotations
>    with little computation effort (such as tokenizers) is this
>    at all significant.
>
> * Our current heap implementation pre-allocates a lot of
>    memory.  This works relatively well for medium size CASes,
>    but it has disadvantages both for very small and very
>    large CASes.  When using Java objects to represent FSs,
>    we leave the memory allocation to the JVM, which seems
>    like the right thing to do.
>
> * We have no garbage collection on the heap.  FSs that are
>    once created stay there for the lifetime of the heap.
>    This is not a problem for most annotators, but there are
>    situations where this behavior is highly undesirable.
>    Using Java objects instead, we would benefit from the
>    garbage collector of the JVM.
>
> So here's the rub.  Before I even start with this, I would
> like to refactor the CAS implementation so I can see what
> I'm doing.  The CASImpl class has grown organically for many
> years now, and it's due for a major overhaul.  I will not
> change any APIs, of course, but I'll probably leave not stone
> unturned in the implementation.  Any objections to that?
>
> Secondly, I will need help with the CAS serialization.  The
> current binary serialization depends completely on the
> heap layout.  Eddie, would you have time to work with me
> on that?  I would like to make the serialization independent
> of the heap implementation and only rely on the low-level
> CAS APIs.  That might be a tiny bit slower (which is still
> to be determined), but it will give us better encapsulation
> and more flexibility with various heap implementations.
>
> Let me know what you think.
>
> --Thilo
>

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Adam Lally wrote:
> My only worry (and maybe I'm just being paranoid) is that we have some
> bug fixes piled up ready to release (such as that XMI deserialization
> problem with arrays and JCAS that at least one user did run into) and
> would like to be able to at some point release those fixes without
> also having to release more ambitious changes requiring a lot more
> testing.
> 
> -Adam

That's a reasonable worry.  OTOH, we have no concrete
plans for a release that I'm aware of.  We don't even
have a release manager for the next release.  If I'm
to be constrained by an upcoming release, we should
discuss some time lines here.  If we're having a bug
fix release very soon, I agree it would be better for
me to work in a branch off of trunk.

--Thilo

Re: New CAS heap impl?

Posted by Adam Lally <al...@alum.rpi.edu>.

On 10/18/07, Thilo Goetz <tw...@gmx.de> wrote:
> I was planning to do the refactoring in the trunk.  Once
> I'm in a position where I have encapsulated the heap so
> far as to even be able to experiment, I will do that in
> a private workspace.  I will certainly not switch the
> heap impl without giving all of you a chance to review
> the changes.
>

My only worry (and maybe I'm just being paranoid) is that we have some
bug fixes piled up ready to release (such as that XMI deserialization
problem with arrays and JCAS that at least one user did run into) and
would like to be able to at some point release those fixes without
also having to release more ambitious changes requiring a lot more
testing.

-Adam

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> Could you outline the refactoring in the trunk prior to doing it - so we
> can have a sense of where you're going, and maybe contribute an idea or two?
> 
> -Marshall

Main goals are: encapsulate the heap so it is only
accessed from the low-level CAS, and make the low-level
CAS impl its own class that's referenced from CASImpl.

The main obstacle to this enterprise will be the roughly
one million places in the code where CASImpl is accessed
directly, instead of through one of its interfaces :-)
I'm probably the main culprit myself, which is one reason
why I want to change it.  If I had done this when I
introduced the low-level CAS, we wouldn't be in this mess
now.

--Thilo

Re: New CAS heap impl?

Posted by Marshall Schor <ms...@schor.com>.

Could you outline the refactoring in the trunk prior to doing it - so we
can have a sense of where you're going, and maybe contribute an idea or two?

-Marshall

Thilo Goetz wrote:
> Adam Lally wrote:
> [...]
>   
>> Also what about the logistics of manging the source code - would this
>> work be done in a separate branch?
>>
>> -Adam
>>     
>
> I was planning to do the refactoring in the trunk.  Once
> I'm in a position where I have encapsulated the heap so
> far as to even be able to experiment, I will do that in
> a private workspace.  I will certainly not switch the
> heap impl without giving all of you a chance to review
> the changes.
>
> --Thilo
>
>
>
>

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Adam Lally wrote:
[...]
> Also what about the logistics of manging the source code - would this
> work be done in a separate branch?
> 
> -Adam

I was planning to do the refactoring in the trunk.  Once
I'm in a position where I have encapsulated the heap so
far as to even be able to experiment, I will do that in
a private workspace.  I will certainly not switch the
heap impl without giving all of you a chance to review
the changes.

--Thilo

Re: New CAS heap impl?

Posted by Adam Lally <al...@alum.rpi.edu>.

On 10/19/07, Thilo Goetz <tw...@gmx.de> wrote:
> Trust me in this.  I will do nothing that means
> a major performance degradation for low-level
> annotators.
>
> It's interesting that you all seem to be expecting
> a performance degradation.  I'm really hoping for
> an improvement :-)

Well I did say I was in favor of this experiment, so I also have hopes
that it will work out well. :)

-Adam

Re: New CAS heap impl?

Posted by Thilo Goetz <tw...@gmx.de>.

Trust me in this.  I will do nothing that means
a major performance degradation for low-level
annotators.

It's interesting that you all seem to be expecting
a performance degradation.  I'm really hoping for
an improvement :-)  I'll be disappointed if an
object-based heap is slower than what we have now,
and maybe then we should not switch.  Let's discuss
performance when we actually have some numbers to
discuss.

--Thilo

Marshall Schor wrote:
> Adam Lally wrote:
>> On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
>>   
>>> I'm thinking about experimenting with alternative heap
>>> implementations in the CAS.  In particular, I would like
>>> to try out a heap impl that uses regular Java objects to
>>> represent feature structures, as opposed to our proprietary
>>> binary heap.
>>>  <snip/>
>>>     
>> My two cents:  I'm in favor of experimenting with a new heap
>> implementation.  For co-located deployments Java object overhead
>> should not be an issue at all, since in almost all cases we end up
>> creating a Java object for each FeatureStructure anyway.  
> Except for one -- maybe major -- case of several commercial (and maybe
> research) implementations using low-level CAS interfaces for
> performance, for components like tokenizers, which have short execution
> paths.   I agree the overhead won't be there if a later annotator then
> uses JCas (or plain CAS, which also creates Java objects when iterating
> (when not using the low-level APIs) in the co-located case, and iterates
> over the tokens.
> 
> -Marshall
>> However for
>> remote services I think it's a different story.  Services may only
>> access some of the objects in the CAS and therefore in the current
>> implementation we never have to create Java objects for many of them.
>> I don't know how significant this is though, since as you said JREs
>> have gotten much better about their object creation overhead and
>> per-object memory footprint.
>>
>> Also what about the logistics of manging the source code - would this
>> work be done in a separate branch?
>>
>> -Adam
>>
>>
>>

Re: New CAS heap impl?

Posted by Marshall Schor <ms...@schor.com>.

Adam Lally wrote:
> On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
>   
>> I'm thinking about experimenting with alternative heap
>> implementations in the CAS.  In particular, I would like
>> to try out a heap impl that uses regular Java objects to
>> represent feature structures, as opposed to our proprietary
>> binary heap.
>>  <snip/>
>>     
>
> My two cents:  I'm in favor of experimenting with a new heap
> implementation.  For co-located deployments Java object overhead
> should not be an issue at all, since in almost all cases we end up
> creating a Java object for each FeatureStructure anyway.  
Except for one -- maybe major -- case of several commercial (and maybe
research) implementations using low-level CAS interfaces for
performance, for components like tokenizers, which have short execution
paths.   I agree the overhead won't be there if a later annotator then
uses JCas (or plain CAS, which also creates Java objects when iterating
(when not using the low-level APIs) in the co-located case, and iterates
over the tokens.

-Marshall
> However for
> remote services I think it's a different story.  Services may only
> access some of the objects in the CAS and therefore in the current
> implementation we never have to create Java objects for many of them.
> I don't know how significant this is though, since as you said JREs
> have gotten much better about their object creation overhead and
> per-object memory footprint.
>
> Also what about the logistics of manging the source code - would this
> work be done in a separate branch?
>
> -Adam
>
>
>

Re: New CAS heap impl?

Posted by Adam Lally <al...@alum.rpi.edu>.

On 10/17/07, Thilo Goetz <tw...@gmx.de> wrote:
> I'm thinking about experimenting with alternative heap
> implementations in the CAS.  In particular, I would like
> to try out a heap impl that uses regular Java objects to
> represent feature structures, as opposed to our proprietary
> binary heap.
>  <snip/>

My two cents:  I'm in favor of experimenting with a new heap
implementation.  For co-located deployments Java object overhead
should not be an issue at all, since in almost all cases we end up
creating a Java object for each FeatureStructure anyway.  However for
remote services I think it's a different story.  Services may only
access some of the objects in the CAS and therefore in the current
implementation we never have to create Java objects for many of them.
I don't know how significant this is though, since as you said JREs
have gotten much better about their object creation overhead and
per-object memory footprint.

Also what about the logistics of manging the source code - would this
work be done in a separate branch?

-Adam