You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Marshall Schor <ms...@schor.com> on 2009/02/18 16:57:06 UTC

small memory footprint tradeoff configuration

Some users are beginning to ask for the ability to shift the internal
tradeoffs UIMA takes toward having a smaller memory footprint, at some
cost in performance.

Several areas in particular have come up: 
  1) "interning" string objects, so that only one copy exists
  2) having some way to "compact" or garbage-collect the CAS

Are there other things that should be considered for trade-off here?

-Marshall

Re: small memory footprint tradeoff configuration

Posted by "D.J. McCloskey" <dj...@ie.ibm.com>.

I think this would be a good addition. I don't see a need for an explicit
call to invoke GC. The parameter of a threshold relative to CAS heapsize
would be useful. But also the ability to specify a GC on exiting one
analysis engine and before invoking the next. This could be turned off for
applications where folks don't want the overhead caused in complex
aggregates. Ideally, a combination of the threshold and the AE boundary
auto collect would also be possible by a specific values of the engine
boundary parameter. The idea here would be to only do the collect on the
boundary based on the heap/threshold ratio having been exceeded. This would
give more than enough control and addresses the need.

I don't know if it would have to be limited to a capability flow scenario
but it would be nice to also have the option to use the output capabilities
to define what to keep or was this what you were thinking?.

The notion of type driven collection is something I'd like to raise also,
often in analysis there are types which are purely supporting the
identification of some primary type. The ability to declare some types as
transient would perhaps aid the overall objective and mitigate in the
aggressive GC cases. On the aggressive GC point, I believe anything which
invalidates lowlevel handles is not really acceptable.

-DJ
-------------------
D.J McCloskey
IBM LanguageWare

IBM Ireland Product Distribution Limited registered in Ireland with number
92815.  Registered office: Oldbrook House, 24-32 Pembroke Road,
Ballsbridge, Dublin 4

  From:       Marshall Schor <ms...@schor.com>                                                                                              

  To:         uima-dev@incubator.apache.org                                                                                               

  Date:       12/03/2009 18:06                                                                                                            

  Subject:    Re: small memory footprint tradeoff configuration                                                                           

I agree with both of these concepts:  only GC'ing things which are not
in the index and also not reachable from something that is in the index,
and making GC'ing (mostly) automatic, based on thresholds, etc, when a
component exits back to the framework.  This would be fine for now - if
use cases come up where some more programmatic control of this is
needed, we could add something.

Maybe the next thing to focus on is the "contract" re: GC running.  For
a component (primitive or aggregate), the proposed contract is to have
the GC not change the FS "id"s that existed prior to the component
running.  This is a tradeoff - for more stability with existing handle
uses, versus less "aggressive" GC's.

-Marshall

Thilo Goetz wrote:
> Adam Lally wrote:
>
>> On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <ms...@schor.com> wrote:
>>
>>> I agree in general about not making things more complicated at least to
>>> the user.  I can imagine education working for
>>>  1) things like string interning
>>>  2) things like deleting features from type systems where they're not
>>> being used, and where the annotator producing them will respect this.
>>>
>>> What this approach seems to miss are the following kinds of things:
>>>
>>> 1) cases where some set of annotators produce feature structures,
which,
>>> after some point, are no longer needed, and are "deleted" but
>>> never-the-less continue to consume space.
>>>
>>> 2) cases where some set of annotators produce feature structures having
>>> lots of fields, where, after some point, the fields are no longer
needed.
>>>
>>> If these are not significant use-cases in practice, then I'm happy to
>>> think-about / work-on other things :-).
>>>
>>>
>> I'd like to propose discussing the different ideas here one at a time.
>>  We had enough trouble coming to any agreement on GC the last time
>> that we discussed it, without also throwing string interning and
>> feature deleting into the mix.
>>
>> So focusing on GC first (unless you think one of the others is more
important):
>>
>> My inclination is to assure that GC deletes only garbage, and that
>> there's no possibility that anything GC'ed could have been referenced
>> by anybody.  The other proposals that don't have this guarantee are
>> scary to me.
>>
>> A way to accomplish this guarantee would be that when the process
>> method of an AnalysisEngine (could be either primitive or aggregate)
>> completes, we can mark as garbage any FS's that were created since the
>> beginning of that process method, but which are not referenced
>> directly or indirectly from anything in the indexes.  Does this
>> concept seem reasonable?
>>
>
> +1. I like the idea because it is sort of local on the one
> hand, but still allows one to delete FSs from indexes
> later in the processing and have them garbage collected
> (on exiting the containing aggregate).
>
>
>> The next question is under what conditions would a GC execute.
>> Requiring an explicit call seems counter to what other garbage
>> collecting runtime environments do, and like Thilo I'm confused about
>> who would call this and when.  I think it would be better to define
>> the parameters that control GC in the PerformanceTuningSettings that
>> we already have, and make them dependent on how much CAS heap space is
>> used relative to a GC threshold that the user has set in the
>> PerformanceTuningSettings.
>>
>
> +1, and the default could be "no GC", so it would be
> perfectly backwards compatible.  I'm thinking of the
> kinds of scenarios that I often work with, where
> basically all the annotations are later written to
> an index, and any attempt at GC would be futile and
> just consume time to no benefit.
>
>
>>  -Adam
>>
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Adam Lally <al...@alum.rpi.edu>.

On Fri, Mar 13, 2009 at 3:07 AM, Thilo Goetz <tw...@gmx.de> wrote:
> Marshall Schor wrote:
>> I agree with both of these concepts:  only GC'ing things which are not
>> in the index and also not reachable from something that is in the index,
>> and making GC'ing (mostly) automatic, based on thresholds, etc, when a
>> component exits back to the framework.  This would be fine for now - if
>> use cases come up where some more programmatic control of this is
>> needed, we could add something.
>>
>> Maybe the next thing to focus on is the "contract" re: GC running.  For
>> a component (primitive or aggregate), the proposed contract is to have
>> the GC not change the FS "id"s that existed prior to the component
>> running.  This is a tradeoff - for more stability with existing handle
>> uses, versus less "aggressive" GC's.
>
> The way I understood it, that was implied by Adam's proposal.
> If you only GC FSs that were created since the component was
> entered, there is no danger of changing the IDs.  Wherever
> you actually do GC, you will have to change the FS IDs.  GC
> without compaction is going to be useless.
>
> The FSs that were created by the app, for example, would never
> be garbage collected, so those references would stay valid until
> the end.
>

It's certainly possible, but more complicated, to have smarter
"handles" that get updated when the underlying FS gets moved in the
heap.  Of course then the CAS would have to have a record of all the
outstanding handles, which it does not now have.  But anyway, with my
proposal this complication would not be necessary.

  -Adam

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.


Thilo Goetz wrote:
> Marshall Schor wrote:
>   
>> I agree with both of these concepts:  only GC'ing things which are not
>> in the index and also not reachable from something that is in the index,
>> and making GC'ing (mostly) automatic, based on thresholds, etc, when a
>> component exits back to the framework.  This would be fine for now - if
>> use cases come up where some more programmatic control of this is
>> needed, we could add something.
>>
>> Maybe the next thing to focus on is the "contract" re: GC running.  For
>> a component (primitive or aggregate), the proposed contract is to have
>> the GC not change the FS "id"s that existed prior to the component
>> running.  This is a tradeoff - for more stability with existing handle
>> uses, versus less "aggressive" GC's.
>>     
>
> The way I understood it, that was implied by Adam's proposal.
> If you only GC FSs that were created since the component was
> entered, there is no danger of changing the IDs.  Wherever
> you actually do GC, you will have to change the FS IDs.  GC
> without compaction is going to be useless.
>   
+1 I missed that deduction ;-)
> The FSs that were created by the app, for example, would never
> be garbage collected, so those references would stay valid until
> the end.
>   
true.  I don't think this would be a significant issue.

This topic seems to have come to a pretty good concensus:

  Control GC overall by some configuration parameter - default to
operate without it.
  Have it occur after some component (primitive or aggregate) per some
(new?) spec to enable it.
  Only have it affect things added by that component - preserving IDs
that surrounding components may be holding on to
  Have it occur only if there is some likely gain from spending the time
to do it (if this can be figured out without much time overhead).

Please correct if this isn't your understanding of the approximate
concensus :-) .  -Marshall
> --Thilo
>
>   
>> -Marshall
>>     
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> I agree with both of these concepts:  only GC'ing things which are not
> in the index and also not reachable from something that is in the index,
> and making GC'ing (mostly) automatic, based on thresholds, etc, when a
> component exits back to the framework.  This would be fine for now - if
> use cases come up where some more programmatic control of this is
> needed, we could add something.
> 
> Maybe the next thing to focus on is the "contract" re: GC running.  For
> a component (primitive or aggregate), the proposed contract is to have
> the GC not change the FS "id"s that existed prior to the component
> running.  This is a tradeoff - for more stability with existing handle
> uses, versus less "aggressive" GC's.

The way I understood it, that was implied by Adam's proposal.
If you only GC FSs that were created since the component was
entered, there is no danger of changing the IDs.  Wherever
you actually do GC, you will have to change the FS IDs.  GC
without compaction is going to be useless.

The FSs that were created by the app, for example, would never
be garbage collected, so those references would stay valid until
the end.

--Thilo

> 
> -Marshall

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

I agree with both of these concepts:  only GC'ing things which are not
in the index and also not reachable from something that is in the index,
and making GC'ing (mostly) automatic, based on thresholds, etc, when a
component exits back to the framework.  This would be fine for now - if
use cases come up where some more programmatic control of this is
needed, we could add something.

Maybe the next thing to focus on is the "contract" re: GC running.  For
a component (primitive or aggregate), the proposed contract is to have
the GC not change the FS "id"s that existed prior to the component
running.  This is a tradeoff - for more stability with existing handle
uses, versus less "aggressive" GC's.

-Marshall

Thilo Goetz wrote:
> Adam Lally wrote:
>   
>> On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <ms...@schor.com> wrote:
>>     
>>> I agree in general about not making things more complicated at least to
>>> the user.  I can imagine education working for
>>>  1) things like string interning
>>>  2) things like deleting features from type systems where they're not
>>> being used, and where the annotator producing them will respect this.
>>>
>>> What this approach seems to miss are the following kinds of things:
>>>
>>> 1) cases where some set of annotators produce feature structures, which,
>>> after some point, are no longer needed, and are "deleted" but
>>> never-the-less continue to consume space.
>>>
>>> 2) cases where some set of annotators produce feature structures having
>>> lots of fields, where, after some point, the fields are no longer needed.
>>>
>>> If these are not significant use-cases in practice, then I'm happy to
>>> think-about / work-on other things :-).
>>>
>>>       
>> I'd like to propose discussing the different ideas here one at a time.
>>  We had enough trouble coming to any agreement on GC the last time
>> that we discussed it, without also throwing string interning and
>> feature deleting into the mix.
>>
>> So focusing on GC first (unless you think one of the others is more important):
>>
>> My inclination is to assure that GC deletes only garbage, and that
>> there's no possibility that anything GC'ed could have been referenced
>> by anybody.  The other proposals that don't have this guarantee are
>> scary to me.
>>
>> A way to accomplish this guarantee would be that when the process
>> method of an AnalysisEngine (could be either primitive or aggregate)
>> completes, we can mark as garbage any FS's that were created since the
>> beginning of that process method, but which are not referenced
>> directly or indirectly from anything in the indexes.  Does this
>> concept seem reasonable?
>>     
>
> +1. I like the idea because it is sort of local on the one
> hand, but still allows one to delete FSs from indexes
> later in the processing and have them garbage collected
> (on exiting the containing aggregate).
>
>   
>> The next question is under what conditions would a GC execute.
>> Requiring an explicit call seems counter to what other garbage
>> collecting runtime environments do, and like Thilo I'm confused about
>> who would call this and when.  I think it would be better to define
>> the parameters that control GC in the PerformanceTuningSettings that
>> we already have, and make them dependent on how much CAS heap space is
>> used relative to a GC threshold that the user has set in the
>> PerformanceTuningSettings.
>>     
>
> +1, and the default could be "no GC", so it would be
> perfectly backwards compatible.  I'm thinking of the
> kinds of scenarios that I often work with, where
> basically all the annotations are later written to
> an index, and any attempt at GC would be futile and
> just consume time to no benefit.
>
>   
>>  -Adam
>>     
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Adam Lally wrote:
> On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <ms...@schor.com> wrote:
>> I agree in general about not making things more complicated at least to
>> the user.  I can imagine education working for
>>  1) things like string interning
>>  2) things like deleting features from type systems where they're not
>> being used, and where the annotator producing them will respect this.
>>
>> What this approach seems to miss are the following kinds of things:
>>
>> 1) cases where some set of annotators produce feature structures, which,
>> after some point, are no longer needed, and are "deleted" but
>> never-the-less continue to consume space.
>>
>> 2) cases where some set of annotators produce feature structures having
>> lots of fields, where, after some point, the fields are no longer needed.
>>
>> If these are not significant use-cases in practice, then I'm happy to
>> think-about / work-on other things :-).
>>
> 
> 
> I'd like to propose discussing the different ideas here one at a time.
>  We had enough trouble coming to any agreement on GC the last time
> that we discussed it, without also throwing string interning and
> feature deleting into the mix.
> 
> So focusing on GC first (unless you think one of the others is more important):
> 
> My inclination is to assure that GC deletes only garbage, and that
> there's no possibility that anything GC'ed could have been referenced
> by anybody.  The other proposals that don't have this guarantee are
> scary to me.
> 
> A way to accomplish this guarantee would be that when the process
> method of an AnalysisEngine (could be either primitive or aggregate)
> completes, we can mark as garbage any FS's that were created since the
> beginning of that process method, but which are not referenced
> directly or indirectly from anything in the indexes.  Does this
> concept seem reasonable?

+1. I like the idea because it is sort of local on the one
hand, but still allows one to delete FSs from indexes
later in the processing and have them garbage collected
(on exiting the containing aggregate).

> 
> The next question is under what conditions would a GC execute.
> Requiring an explicit call seems counter to what other garbage
> collecting runtime environments do, and like Thilo I'm confused about
> who would call this and when.  I think it would be better to define
> the parameters that control GC in the PerformanceTuningSettings that
> we already have, and make them dependent on how much CAS heap space is
> used relative to a GC threshold that the user has set in the
> PerformanceTuningSettings.

+1, and the default could be "no GC", so it would be
perfectly backwards compatible.  I'm thinking of the
kinds of scenarios that I often work with, where
basically all the annotations are later written to
an index, and any attempt at GC would be futile and
just consume time to no benefit.

> 
>  -Adam

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

On Fri, Mar 13, 2009 at 10:03 AM, Adam Lally <al...@alum.rpi.edu> wrote:
>>
>> One user scenario that motivated this thread was that an aggregate designer
>> knew exactly when they wanted to do GC. What is wrong with giving them
>> a CAS method to call?
>>
>
> Where would they call this method from?  An aggregate designer would
> just write a descriptor, not code.  I suppose they could stick in a
> "GC annotator" that does nothing but call the GC method, but this
> seems like a hack.

OK. Another problem with a GC call is that FS handles within that
component could
be broken, so better not to allow that possibility.

> Can this be handled with configuration?  I think somebody suggested
> something like this already, but I can't find the quote right now.

The quote is from DJ, I think.

> One of the GC configuration options would be to specify that a GC
> should always occur after a particular delegate AE has finished (or
> that GC should occur after that AE finished, if the heap size is also
> above a certain threshold).

Tong put a proposal in
http://cwiki.apache.org/UIMA/improving-uima-debug-capabilities.html
which includes an "operational descriptor" to control runtime behaviors.
Maybe that would be useful for controlling GC.

Eddie

Re: small memory footprint tradeoff configuration

Posted by Adam Lally <al...@alum.rpi.edu>.

On Thu, Mar 12, 2009 at 3:49 PM, Eddie Epstein <ea...@gmail.com> wrote:
> On Thu, Mar 12, 2009 at 12:14 PM, Adam Lally <al...@alum.rpi.edu> wrote:
>> The next question is under what conditions would a GC execute.
>> Requiring an explicit call seems counter to what other garbage
>> collecting runtime environments do, and like Thilo I'm confused about
>> who would call this and when.  I think it would be better to define
>> the parameters that control GC in the PerformanceTuningSettings that
>> we already have, and make them dependent on how much CAS heap space is
>> used relative to a GC threshold that the user has set in the
>> PerformanceTuningSettings.
>
> Given the current CAS implementation, GC related operations are going to
> be fairly expensive, for example, even just computing how much data is
> available to be deleted. So I'd be concerned that any automatic GC based
> on CAS size or other dynamic properties may often be CPU costly.
>
> One user scenario that motivated this thread was that an aggregate designer
> knew exactly when they wanted to do GC. What is wrong with giving them
> a CAS method to call?
>

Where would they call this method from?  An aggregate designer would
just write a descriptor, not code.  I suppose they could stick in a
"GC annotator" that does nothing but call the GC method, but this
seems like a hack.

Can this be handled with configuration?  I think somebody suggested
something like this already, but I can't find the quote right now.
One of the GC configuration options would be to specify that a GC
should always occur after a particular delegate AE has finished (or
that GC should occur after that AE finished, if the heap size is also
above a certain threshold).

  -Adam

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

On Thu, Mar 12, 2009 at 12:14 PM, Adam Lally <al...@alum.rpi.edu> wrote:
> The next question is under what conditions would a GC execute.
> Requiring an explicit call seems counter to what other garbage
> collecting runtime environments do, and like Thilo I'm confused about
> who would call this and when.  I think it would be better to define
> the parameters that control GC in the PerformanceTuningSettings that
> we already have, and make them dependent on how much CAS heap space is
> used relative to a GC threshold that the user has set in the
> PerformanceTuningSettings.

Given the current CAS implementation, GC related operations are going to
be fairly expensive, for example, even just computing how much data is
available to be deleted. So I'd be concerned that any automatic GC based
on CAS size or other dynamic properties may often be CPU costly.

One user scenario that motivated this thread was that an aggregate designer
knew exactly when they wanted to do GC. What is wrong with giving them
a CAS method to call?

Re: small memory footprint tradeoff configuration

Posted by Adam Lally <al...@alum.rpi.edu>.

On Wed, Mar 11, 2009 at 8:53 AM, Marshall Schor <ms...@schor.com> wrote:
> I agree in general about not making things more complicated at least to
> the user.  I can imagine education working for
>  1) things like string interning
>  2) things like deleting features from type systems where they're not
> being used, and where the annotator producing them will respect this.
>
> What this approach seems to miss are the following kinds of things:
>
> 1) cases where some set of annotators produce feature structures, which,
> after some point, are no longer needed, and are "deleted" but
> never-the-less continue to consume space.
>
> 2) cases where some set of annotators produce feature structures having
> lots of fields, where, after some point, the fields are no longer needed.
>
> If these are not significant use-cases in practice, then I'm happy to
> think-about / work-on other things :-).
>

I'd like to propose discussing the different ideas here one at a time.
 We had enough trouble coming to any agreement on GC the last time
that we discussed it, without also throwing string interning and
feature deleting into the mix.

So focusing on GC first (unless you think one of the others is more important):

My inclination is to assure that GC deletes only garbage, and that
there's no possibility that anything GC'ed could have been referenced
by anybody.  The other proposals that don't have this guarantee are
scary to me.

A way to accomplish this guarantee would be that when the process
method of an AnalysisEngine (could be either primitive or aggregate)
completes, we can mark as garbage any FS's that were created since the
beginning of that process method, but which are not referenced
directly or indirectly from anything in the indexes.  Does this
concept seem reasonable?

The next question is under what conditions would a GC execute.
Requiring an explicit call seems counter to what other garbage
collecting runtime environments do, and like Thilo I'm confused about
who would call this and when.  I think it would be better to define
the parameters that control GC in the PerformanceTuningSettings that
we already have, and make them dependent on how much CAS heap space is
used relative to a GC threshold that the user has set in the
PerformanceTuningSettings.

 -Adam

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.


Thilo Goetz wrote:
> Marshall Schor wrote:
> [...]
>   
>> I agree that backward compatibility is important and is an issue.  To
>> help the transition to this new scheme, I think an overall global switch
>> is needed (similar to the switches we have for JCas "interning") that
>> would by default make things work the way they do now.  A user
>> interested in small-footprint operation (and in trading off some
>> additional processing cycles to achieve it) would enable this switch.
>>
>> To help it "work" - we would allow things to continue to operation which
>> "set" a non-stored feature - theset would just become no-ops.  Then if
>> the annotator wasn't paying attention to ResultSpecification, and tried
>> to set features that were not used, it would still work. 
>>
>> On the other end, if an annotator actually made use of a particular
>> feature, but didn't specify it in its "input capability specification",
>> that would fail with this scheme.  The failure would be some kind of
>> Java exception, which would probably be noticed.  To recover, a user of
>> such a component would modify the input capability specification to
>> indicate that that feature was needed. 
>>     
>
> If a feature is defined in the type system, it should be there
> for the annotator writer to use.  Who are we to know how people
> will use those features?
>
>   
>> As I write this, I notice that the input capability specification for a
>> primitive annotator doesn't quite fit the meaning hear - because I think
>> it means that this annotator needs that feature upon input - and this
>> edge case - where the annotator itself produces this feature, and then
>> also uses it - is not part of that definition. We could either expand
>> the meaning here to include this edge case, or (possibly a better
>> option) introduce, explicitly, another piece of metadata indicating that
>> a particular type/field was both created and used by this one primitive
>> annotator.  A third option could be to store these "unused" features if
>> set (in some out-of-line temporary storage) for the duration of the
>> running of a particular annotator, just in case these were "used" by the
>> same annotator, and then discard that extra storage after the annotator
>> exits.  This would be a big (but temporary) storage hit, though, so I
>> don't think I would want to do this.
>>     
>
> I vote we don't make things even more complicated than they
> already are, and educate those people who need a performance
> boost.
>   
I agree in general about not making things more complicated at least to
the user.  I can imagine education working for
  1) things like string interning
  2) things like deleting features from type systems where they're not
being used, and where the annotator producing them will respect this.

What this approach seems to miss are the following kinds of things:

1) cases where some set of annotators produce feature structures, which,
after some point, are no longer needed, and are "deleted" but
never-the-less continue to consume space.

2) cases where some set of annotators produce feature structures having
lots of fields, where, after some point, the fields are no longer needed.

If these are not significant use-cases in practice, then I'm happy to
think-about / work-on other things :-).

-Marshall
> --Thilo
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
[...]
> I agree that backward compatibility is important and is an issue.  To
> help the transition to this new scheme, I think an overall global switch
> is needed (similar to the switches we have for JCas "interning") that
> would by default make things work the way they do now.  A user
> interested in small-footprint operation (and in trading off some
> additional processing cycles to achieve it) would enable this switch.
> 
> To help it "work" - we would allow things to continue to operation which
> "set" a non-stored feature - theset would just become no-ops.  Then if
> the annotator wasn't paying attention to ResultSpecification, and tried
> to set features that were not used, it would still work. 
> 
> On the other end, if an annotator actually made use of a particular
> feature, but didn't specify it in its "input capability specification",
> that would fail with this scheme.  The failure would be some kind of
> Java exception, which would probably be noticed.  To recover, a user of
> such a component would modify the input capability specification to
> indicate that that feature was needed. 

If a feature is defined in the type system, it should be there
for the annotator writer to use.  Who are we to know how people
will use those features?

> 
> As I write this, I notice that the input capability specification for a
> primitive annotator doesn't quite fit the meaning hear - because I think
> it means that this annotator needs that feature upon input - and this
> edge case - where the annotator itself produces this feature, and then
> also uses it - is not part of that definition. We could either expand
> the meaning here to include this edge case, or (possibly a better
> option) introduce, explicitly, another piece of metadata indicating that
> a particular type/field was both created and used by this one primitive
> annotator.  A third option could be to store these "unused" features if
> set (in some out-of-line temporary storage) for the duration of the
> running of a particular annotator, just in case these were "used" by the
> same annotator, and then discard that extra storage after the annotator
> exits.  This would be a big (but temporary) storage hit, though, so I
> don't think I would want to do this.

I vote we don't make things even more complicated than they
already are, and educate those people who need a performance
boost.

--Thilo

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

Thanks for your comments.

Thilo Goetz wrote:
> Marshall Schor wrote:
>   
>> After reviewing the previous chain of discussion on this topic, I would
>> like to start the next round, hopefully getting to some convergence :-).
>>
>> 1) On the topic of doing GC (garbage collection) versus copy to another
>> CAS - GC is conceptually perhaps less complex - you don't have mutliple
>> CASes around.  I note that Adam (and maybe others) preferred this.
>>
>> 2) Keeping IDs: within the context of an aggregate, doing a GC and
>> having the resulting FeatureStructure IDs change is an issue.  Some
>> suggestions to lessen this:
>>   a) No automatic GC, just do when requested.
>>   b) If done in an Aggregate (could be an inner one), guarantee that IDs
>> that existed upon entry to the aggregate would be preserved.  This could
>> be done using the high-water-mark mechanism that was put in for
>> delta-cas.  This seems to have some nice properties concerning what a
>> user of an aggregate has to know about the aggregate's inner behavior -
>> in particular, the user would not need to know about that aggregate's
>> call to GC, since the outer aggregate's "handles" to feature structures
>> would not move due to the GC.
>>     
>
> I'm a little foggy on the concept of explicit GC calls.  Where would
> the API reside, and who would be allowed to call it, and when?
>   
One possibility is to not have it be callable, but to have an annotator
or aggregate marked so that it would be called by the framework after a
CAS exited that component.  I haven't figured out other details here,
though.  Details might include (a) an overall enabling of this, (b) some
indication that there's enough to reclaim in the CAS to make it
worthwhile.  Worthwhile may be hard to define, though, so some kind of
explicit indication to do it might be needed.  The indication might be
to set some context flag saying to do the GC when you exit the
annotator.  Here's a use case: testing shows your application is taking
31 MB of storage, but the app in which you're embedded has a hard limit
of 30 MB... 
>   
>> Part of this concept could be to allow an option to have the GC to move
>> everything; maybe a different explicit call.
>>
>> 3) There is a potential to trade performance for space using String
>> "interning" - to insure string values set for features are stored just
>> once.  This is typically done using a hashmap of some kind, so there's
>> overhead for that, so it may or may not actually reduce space.  There is
>> also a potential to store Strings using UTF-8 encoding - which may or
>> may not save space (depends on the string, etc.)
>>     
>
> Any Java programmer who does a lot of string handling should know
> when to intern strings, when to use constants, and when to create
> new Strings.  
Well, my experience is quite different.  Many Java programmers around
here were unfamiliar with interning.  But I do basically agree that some
(or most) of this benefit can happen via annotator writers.  Perhaps we
need to document this in some new section (e.g. on how to write small
footprint annotators).

> Who do you think you're going to be helping with this?

> What's the use case?
>> For String interning, we could have two different kinds of approaches to
>> specifying its use: a global, application-wide setting or a specific
>> setting (e.g., add a new basic type to UIMA, called, for instance,
>> cas.uima.SharedString). Using a new type would allow users to pick just
>> the cases where they wanted the extra machinery to coalesce equal
>> strings to one shared object.
>>
>> 4) There is a potentially big space reduction possible by being able to
>> mark some fields of feature structures as never being "read".  Such
>> fields could then be not stored.  For instance, a feature structure of
>> type TOKEN might have many fields, representing various information -
>> only some of which might be used in a particular application.  Even if
>> the ResultSpecification for a tokenizer is set to indicate not to
>> "produce" these fields, today, space is consumed for those fields (they
>> are filled with "null" or 0).  If there are many instances of this
>> feature structure type (such as Token), this can be a significant space
>> saver.  To identify a field as never being read, one could look at the
>> aggregate's component's capability specification - and mark the field if
>> it doesn't appear in any of the delegate's input specs.  For an
>> outermost aggregate, one would probably want to add that aggregate's
>> output specification - to capture the outer application's potential use
>> of fields.
>>     
>
> We need to be careful here not to destroy backward compatibility.
> Result specs are optional in fixed flows, and many annotators (that
> I use) don't use them.
>
> What we usually do in cases like this is that we modify the type
> system.  The annotator (in this case, the tokenizer) checks the
> type system on startup.  Presence/absence of features triggers/
> inhibits certain processing.  This may not be an ideal solution,
> but it works because it requires the cooperation of the annotator
> writer.
>
> If you don't have access to the annotator's source code, you'll
> never know if it can really work without those features in all
> cases.  If you do have the source code, you can make it work
> with the scheme above.
>   
I agree that backward compatibility is important and is an issue.  To
help the transition to this new scheme, I think an overall global switch
is needed (similar to the switches we have for JCas "interning") that
would by default make things work the way they do now.  A user
interested in small-footprint operation (and in trading off some
additional processing cycles to achieve it) would enable this switch.

To help it "work" - we would allow things to continue to operation which
"set" a non-stored feature - theset would just become no-ops.  Then if
the annotator wasn't paying attention to ResultSpecification, and tried
to set features that were not used, it would still work. 

On the other end, if an annotator actually made use of a particular
feature, but didn't specify it in its "input capability specification",
that would fail with this scheme.  The failure would be some kind of
Java exception, which would probably be noticed.  To recover, a user of
such a component would modify the input capability specification to
indicate that that feature was needed. 

As I write this, I notice that the input capability specification for a
primitive annotator doesn't quite fit the meaning hear - because I think
it means that this annotator needs that feature upon input - and this
edge case - where the annotator itself produces this feature, and then
also uses it - is not part of that definition. We could either expand
the meaning here to include this edge case, or (possibly a better
option) introduce, explicitly, another piece of metadata indicating that
a particular type/field was both created and used by this one primitive
annotator.  A third option could be to store these "unused" features if
set (in some out-of-line temporary storage) for the duration of the
running of a particular annotator, just in case these were "used" by the
same annotator, and then discard that extra storage after the annotator
exits.  This would be a big (but temporary) storage hit, though, so I
don't think I would want to do this.

-Marshall
>   
>> There is another layer of space granularity we could consider. This
>> would be to let the assembler divide the components into "groups"
>> (perhaps just using the natural grouping aggregates provide), and
>> compute (or specify) which fields of types were "not used" by group. For
>> instance, consider group1 which has lots of extra fields in Token, which
>> are used during group1 components, followed by group2 processing, which
>> doesn't use most of the fields in Token.  A GC operation could then
>> "compress" the representation of Token (of those where preserving the
>> FeatureStructure ID wasn't required). 
>>
>> To avoid creation of new collections of components, we could restrict
>> the "groups" to be just what a particular aggregate contained.  Using
>> this approach, we could envision using the input/output capability
>> specifications to automatically deduce which extra fields could be
>> eliminated.  We could also have an automatic GC mode - which would
>> invoke the GC (that didn't alter pre-existing feature structures) at the
>> end of all aggregates.  Although this might do too many gc's, it would
>> be conceptually simple.
>>
>> -Marshall
>>     
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> After reviewing the previous chain of discussion on this topic, I would
> like to start the next round, hopefully getting to some convergence :-).
> 
> 1) On the topic of doing GC (garbage collection) versus copy to another
> CAS - GC is conceptually perhaps less complex - you don't have mutliple
> CASes around.  I note that Adam (and maybe others) preferred this.
> 
> 2) Keeping IDs: within the context of an aggregate, doing a GC and
> having the resulting FeatureStructure IDs change is an issue.  Some
> suggestions to lessen this:
>   a) No automatic GC, just do when requested.
>   b) If done in an Aggregate (could be an inner one), guarantee that IDs
> that existed upon entry to the aggregate would be preserved.  This could
> be done using the high-water-mark mechanism that was put in for
> delta-cas.  This seems to have some nice properties concerning what a
> user of an aggregate has to know about the aggregate's inner behavior -
> in particular, the user would not need to know about that aggregate's
> call to GC, since the outer aggregate's "handles" to feature structures
> would not move due to the GC.

I'm a little foggy on the concept of explicit GC calls.  Where would
the API reside, and who would be allowed to call it, and when?

> 
> Part of this concept could be to allow an option to have the GC to move
> everything; maybe a different explicit call.
> 
> 3) There is a potential to trade performance for space using String
> "interning" - to insure string values set for features are stored just
> once.  This is typically done using a hashmap of some kind, so there's
> overhead for that, so it may or may not actually reduce space.  There is
> also a potential to store Strings using UTF-8 encoding - which may or
> may not save space (depends on the string, etc.)

Any Java programmer who does a lot of string handling should know
when to intern strings, when to use constants, and when to create
new Strings.  Who do you think you're going to be helping with this?
What's the use case?

> 
> For String interning, we could have two different kinds of approaches to
> specifying its use: a global, application-wide setting or a specific
> setting (e.g., add a new basic type to UIMA, called, for instance,
> cas.uima.SharedString). Using a new type would allow users to pick just
> the cases where they wanted the extra machinery to coalesce equal
> strings to one shared object.
> 
> 4) There is a potentially big space reduction possible by being able to
> mark some fields of feature structures as never being "read".  Such
> fields could then be not stored.  For instance, a feature structure of
> type TOKEN might have many fields, representing various information -
> only some of which might be used in a particular application.  Even if
> the ResultSpecification for a tokenizer is set to indicate not to
> "produce" these fields, today, space is consumed for those fields (they
> are filled with "null" or 0).  If there are many instances of this
> feature structure type (such as Token), this can be a significant space
> saver.  To identify a field as never being read, one could look at the
> aggregate's component's capability specification - and mark the field if
> it doesn't appear in any of the delegate's input specs.  For an
> outermost aggregate, one would probably want to add that aggregate's
> output specification - to capture the outer application's potential use
> of fields.

We need to be careful here not to destroy backward compatibility.
Result specs are optional in fixed flows, and many annotators (that
I use) don't use them.

What we usually do in cases like this is that we modify the type
system.  The annotator (in this case, the tokenizer) checks the
type system on startup.  Presence/absence of features triggers/
inhibits certain processing.  This may not be an ideal solution,
but it works because it requires the cooperation of the annotator
writer.

If you don't have access to the annotator's source code, you'll
never know if it can really work without those features in all
cases.  If you do have the source code, you can make it work
with the scheme above.

> 
> There is another layer of space granularity we could consider. This
> would be to let the assembler divide the components into "groups"
> (perhaps just using the natural grouping aggregates provide), and
> compute (or specify) which fields of types were "not used" by group. For
> instance, consider group1 which has lots of extra fields in Token, which
> are used during group1 components, followed by group2 processing, which
> doesn't use most of the fields in Token.  A GC operation could then
> "compress" the representation of Token (of those where preserving the
> FeatureStructure ID wasn't required). 
> 
> To avoid creation of new collections of components, we could restrict
> the "groups" to be just what a particular aggregate contained.  Using
> this approach, we could envision using the input/output capability
> specifications to automatically deduce which extra fields could be
> eliminated.  We could also have an automatic GC mode - which would
> invoke the GC (that didn't alter pre-existing feature structures) at the
> end of all aggregates.  Although this might do too many gc's, it would
> be conceptually simple.
> 
> -Marshall

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

After reviewing the previous chain of discussion on this topic, I would
like to start the next round, hopefully getting to some convergence :-).

1) On the topic of doing GC (garbage collection) versus copy to another
CAS - GC is conceptually perhaps less complex - you don't have mutliple
CASes around.  I note that Adam (and maybe others) preferred this.

2) Keeping IDs: within the context of an aggregate, doing a GC and
having the resulting FeatureStructure IDs change is an issue.  Some
suggestions to lessen this:
  a) No automatic GC, just do when requested.
  b) If done in an Aggregate (could be an inner one), guarantee that IDs
that existed upon entry to the aggregate would be preserved.  This could
be done using the high-water-mark mechanism that was put in for
delta-cas.  This seems to have some nice properties concerning what a
user of an aggregate has to know about the aggregate's inner behavior -
in particular, the user would not need to know about that aggregate's
call to GC, since the outer aggregate's "handles" to feature structures
would not move due to the GC.

Part of this concept could be to allow an option to have the GC to move
everything; maybe a different explicit call.

3) There is a potential to trade performance for space using String
"interning" - to insure string values set for features are stored just
once.  This is typically done using a hashmap of some kind, so there's
overhead for that, so it may or may not actually reduce space.  There is
also a potential to store Strings using UTF-8 encoding - which may or
may not save space (depends on the string, etc.)

For String interning, we could have two different kinds of approaches to
specifying its use: a global, application-wide setting or a specific
setting (e.g., add a new basic type to UIMA, called, for instance,
cas.uima.SharedString). Using a new type would allow users to pick just
the cases where they wanted the extra machinery to coalesce equal
strings to one shared object.

4) There is a potentially big space reduction possible by being able to
mark some fields of feature structures as never being "read".  Such
fields could then be not stored.  For instance, a feature structure of
type TOKEN might have many fields, representing various information -
only some of which might be used in a particular application.  Even if
the ResultSpecification for a tokenizer is set to indicate not to
"produce" these fields, today, space is consumed for those fields (they
are filled with "null" or 0).  If there are many instances of this
feature structure type (such as Token), this can be a significant space
saver.  To identify a field as never being read, one could look at the
aggregate's component's capability specification - and mark the field if
it doesn't appear in any of the delegate's input specs.  For an
outermost aggregate, one would probably want to add that aggregate's
output specification - to capture the outer application's potential use
of fields.

There is another layer of space granularity we could consider. This
would be to let the assembler divide the components into "groups"
(perhaps just using the natural grouping aggregates provide), and
compute (or specify) which fields of types were "not used" by group. For
instance, consider group1 which has lots of extra fields in Token, which
are used during group1 components, followed by group2 processing, which
doesn't use most of the fields in Token.  A GC operation could then
"compress" the representation of Token (of those where preserving the
FeatureStructure ID wasn't required). 

To avoid creation of new collections of components, we could restrict
the "groups" to be just what a particular aggregate contained.  Using
this approach, we could envision using the input/output capability
specifications to automatically deduce which extra fields could be
eliminated.  We could also have an automatic GC mode - which would
invoke the GC (that didn't alter pre-existing feature structures) at the
end of all aggregates.  Although this might do too many gc's, it would
be conceptually simple.

-Marshall

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

Another way to reduce the footprint of UIMA:

One user reported the basic UIMA framework as taking approx. 5 MB (not
sure exactly what was measured).  I investigated to see if UIMA might be
loading more classes than needed.  I found that at startup time, UIMA
reads a factory configuration file and assigns classes to interfaces,
storing these in a hashmap. 

The factory configuration (located in
uimaj-core/src/main/resources/org.apache.uima.impl/factoryConfig.xml)
has specs for things like the collection processing manager. 

The startup code does a Class.forName on these to load them (and confirm
they are present).   This makes Java "lazy loading" not work so well,
since many of these won't be used.  I did a heapdump of a tiny UIMA
application using the
uimaj-examples/src/main/java/org.apache.uima.examples/ExampleApplication.java
- reading a simple descriptor and running it, and found many classes
pertaining to the CPE (Collection Processing) which my test application
doesn't use. 

I see two possible approaches to improving this: one is having users who
are memory sensitive learn more about the factory configuration file,
and have them remove parts of it that are for things they won't be
using.  I don't much like this approach - it's error prone, especially
over time...

The other approach is to modify the way the factory configuration does
it resolution to make it lazy - for instance, changing it so that only
on first reference to an interface would the corresponding class be
loaded.  This has a potential issue where the failure to find a
particular needed implementation in the class path might happen later in
a run, rather than at the start, but I don't think that's a serious
drawback, compared to the potential footprint reduction.

What do others think?

-Marshall

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

Thilo Goetz wrote:
> Marshall Schor wrote:
>   
>> Thilo Goetz wrote:
>>     
>>> Marshall Schor wrote:
>>>   
>>>       
>>>> One of the ideas for GC was to change the basic heap design to use java
>>>> objects for feature structures.  I'm thinking of some kind of explicit
>>>> GC, called by the user, at a point where they know a bunch of objects is
>>>> no longer needed (because they've just deleted things out of the index,
>>>> for instance).  The use case is one where some set of annotators might
>>>> generate many "alternatives", and then a subsequent annotator "picks"
>>>> one, and removes the others from the index.
>>>>
>>>> I'm thinking that the implementation might be based on the deep CAS copy
>>>> code we already have, modified in an attempt to avoid needing extra space. 
>>>>
>>>> I think this would avoid many of the other issues mentioned in the
>>>> previous thread http://markmail.org/thread/aolbz4nrvmgjhuyb.  If there
>>>> are issues/concerns with this kind of approach, please post/discuss.
>>>>     
>>>>         
>>> It would change the internal IDs of FSs, which was always a
>>> big no-no for some people.
>>>   
>>>       
>> True.  For things like delta-cas, or parallel processing flows in
>> UIMA-AS, which use a high-water mark of some kind, I'm thinking (hoping)
>> we could make this work.
>>
>> For other cases, I don't know how much an issue this is.  In any case,
>> by having it be not-automatic, but rather user-invoked via some explicit
>> call (e.g., myCas.reclaim-space()), I'm hoping (again) that only users
>> who were not needing the internal IDs the same would call this. 
>>     
>
> So the users are supposed to figure out if they need internal
> IDs?  I don't think that's a good idea.  Either we make guarantees
> about things like references into the CAS surviving calls to
> process(), or we don't.
>   
What I'm thinking here, is that references into the CAS that use
low-level "handles" would not be "correct" after the GC call.  This
would affect people using low-level references.

For people who use normal interfaces and obtained Java cover objects
(either JCas ones, or non-JCas ones), these would also be invalid.  The
JCas ones, however, could be fixed up, if the JCas Cache is enabled.  If
it's not, then those references become invalid, too. 

Any suggestions on how to handle these?  Perhaps one suggestion might be
to give up on complete gc saving, and instead, leave everything where
they are, but create a free-cell chain and reuse the space.  This has
problems with "fragmentation", of course, so probably isn't that good an
idea...

Another idea is to not do GC at all, but instead, make it really easy to
logically split a pipeline (at the GC point) and do a quick copy into a
new CAS, followed by releasing the previous CAS (and insuring its space
was reclaimed, perhaps even incrementally, so that not so much extra
space would be needed while doing the copy).   Exposing this as an
explicit step insures a level of clarity - the previous CAS is "reset",
so that users would know that no references into it could be valid.

One other idea that I was reminded of - some annotators may have many
"fields" within the feature structure they "could" populate, but at
various points, the extra fields may no longer be necessary.  Having a
special copy step that allows subsets of the fields in the new CAS could
also reclaim this space.  Essentially, the type systems would differ by
having some of the fields missing, and that would be allowed (perhaps
under some specification, to reduce mistakes).

-Marshall
>   
>> This was helpful - please post refs to other areas to look into before
>> proceeding.
>>
>> -Marshall
>>     
>>>   
>>>       
>>>> -Marshall
>>>>
>>>> Thilo Goetz wrote:
>>>>     
>>>>         
>>>>> Marshall Schor wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Some users are beginning to ask for the ability to shift the internal
>>>>>> tradeoffs UIMA takes toward having a smaller memory footprint, at some
>>>>>> cost in performance.
>>>>>>
>>>>>> Several areas in particular have come up: 
>>>>>>   1) "interning" string objects, so that only one copy exists
>>>>>>   2) having some way to "compact" or garbage-collect the CAS
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> My suggestions for garbage collection in the CAS met with strong
>>>>> resistance on this list in the past.  I'll be interested to see
>>>>> what you'll come up with to overcome that resistance.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> Are there other things that should be considered for trade-off here?
>>>>>>
>>>>>> -Marshall
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>   
>>>>>       
>>>>>           
>>>   
>>>       
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Adam Lally <al...@alum.rpi.edu>.

On Wed, Feb 25, 2009 at 9:07 AM, Eddie Epstein <ea...@gmail.com> wrote:
>> It seems like Marshall's angle (if I understood it) is not really GC
>> at all, but a model where an annotator decides to explicitly delete
>> FS.  I could be okay with that idea, too.  A GC model by definition
>> should preserve any referenced FSs, but if we say we have an explicit
>> deletion model where anybody can delete anyone else's stuff, at least
>> we won't confuse people about what's going on.  Current applications
>> that use existing annotators would not break (because the annotators
>> would not delete anything), and if a new annotator is introduced that
>> breaks the application, it's the annotator's fault for being too
>> aggressive in deleting stuff that someone else might still need.
>
> What is different here? UIMA already lets any annotator delete anything
> in the CAS, where deletion is defined by removing FS from the index
> and removing references. GC would just add the ability to reclaim FS
> heap space.
>

By "explicitly delete" I meant something akin to a
non-garbage-collecting language like C.  The user's code has to say
explicitly what FS to delete, and if those happen to be still
referenced by something, that's their problem.  I thought that might
have been what Marshall was suggesting.  I thought he was suggesting
an explicit operation that an annotator would need to call in addition
to making an FS unreferencable.  But maybe I misunderstood.  Anyway, I
prefer a GC option rather than this.

  -Adam

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

On Tue, Feb 24, 2009 at 9:36 AM, Adam Lally <al...@alum.rpi.edu> wrote:
> To address Eddie's point about Vinci services breaking FS handles
> already - I consider that a bug, so am not happy using that as a
> rationale to invalidate FS handles as a general policy.  And I'm
> worried that users who haven't been using Vinci services (I bet we
> have plenty of those) have built applications that rely on this
> behavior.  I remember suggesting that we post on the user list about
> this, but am not sure if we ever did.
>
> If you do a GC approach, is there not any way to include
> application-created FeatureStructures as part of the "root" set?  Or
> to look at it another way, the set of FS's that you do the GC over is
> only those created since the CAS was input to the current AE (possible
> aggregate).

This seems like a really good approach, allowing a set of annotators
in an aggregate to create temporary content and then fully remove
it. Bhavani has already created a mechanism for setting marks in the
FS heap at AE entry points. Based on these marks it should be easy
to do as you suggest: limit GC to those FS created since entering the
current aggregate. A client application would be able to preserve its
references by either running primitive AE in an aggregate, or by manually
setting the mark itself.

> It seems like Marshall's angle (if I understood it) is not really GC
> at all, but a model where an annotator decides to explicitly delete
> FS.  I could be okay with that idea, too.  A GC model by definition
> should preserve any referenced FSs, but if we say we have an explicit
> deletion model where anybody can delete anyone else's stuff, at least
> we won't confuse people about what's going on.  Current applications
> that use existing annotators would not break (because the annotators
> would not delete anything), and if a new annotator is introduced that
> breaks the application, it's the annotator's fault for being too
> aggressive in deleting stuff that someone else might still need.

What is different here? UIMA already lets any annotator delete anything
in the CAS, where deletion is defined by removing FS from the index
and removing references. GC would just add the ability to reclaim FS
heap space.

Eddie

Re: small memory footprint tradeoff configuration

Posted by Adam Lally <al...@alum.rpi.edu>.

On Tue, Feb 24, 2009 at 2:53 AM, Thilo Goetz <tw...@gmx.de> wrote:
> I have found the discussion again that I was referring to.  It wasn't
> on this list, it was in the OASIS spec discussions.  Sorry about the
> confusion.  I don't feel at liberty to publish that conversation here,
> but maybe Adam would like to comment?  He and I were debating this at
> the time (nearly two years ago).
>

I'm not sure about what OASIS discussion you mean (is it about xmi:id
consistency?), but I thought the link that Marshall posted was a
reasonable summary of the discussion, including the concerns that I
had:
http://markmail.org/thread/aolbz4nrvmgjhuyb.

The only sticking point I was really concerned about was the
invalidation of the FS handle held by an application.  But, it was
definitely not my intention to shoot down any work in this area (in
fact you'll see in that email thread where I explicitly said I'm in
favor of doing something in this space).  I just want to discuss it
and see if we can come to a mutually acceptable plan.

To address Eddie's point about Vinci services breaking FS handles
already - I consider that a bug, so am not happy using that as a
rationale to invalidate FS handles as a general policy.  And I'm
worried that users who haven't been using Vinci services (I bet we
have plenty of those) have built applications that rely on this
behavior.  I remember suggesting that we post on the user list about
this, but am not sure if we ever did.

If you do a GC approach, is there not any way to include
application-created FeatureStructures as part of the "root" set?  Or
to look at it another way, the set of FS's that you do the GC over is
only those created since the CAS was input to the current AE (possible
aggregate).

It seems like Marshall's angle (if I understood it) is not really GC
at all, but a model where an annotator decides to explicitly delete
FS.  I could be okay with that idea, too.  A GC model by definition
should preserve any referenced FSs, but if we say we have an explicit
deletion model where anybody can delete anyone else's stuff, at least
we won't confuse people about what's going on.  Current applications
that use existing annotators would not break (because the annotators
would not delete anything), and if a new annotator is introduced that
breaks the application, it's the annotator's fault for being too
aggressive in deleting stuff that someone else might still need.

-Adam

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

> Eddie Epstein wrote:
>> Process calls to a Vinci service have always broken FS references.
>> Same for calls thru the compatibility wrapper that allows calling
>> colocated UIMA 1.4x annotators from Apache UIMA.

Actually, I think that the compatibility wrapper does preserve FS
addresses because it uses binary serialization. But Vinci definitely
does not.

Eddie

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
> On Mon, Feb 23, 2009 at 12:10 PM, Thilo Goetz <tw...@gmx.de> wrote:
>> So the users are supposed to figure out if they need internal
>> IDs?  I don't think that's a good idea.  Either we make guarantees
>> about things like references into the CAS surviving calls to
>> process(), or we don't.
> 
> Process calls to a Vinci service have always broken FS references.
> Same for calls thru the compatibility wrapper that allows calling
> colocated UIMA 1.4x annotators from Apache UIMA. So we have not in the
> past made such guarantees for remote or even all colocated components.
> 
> Supporting GC called in a service will require some work. If the
> client uses XMI serialization and supports delta CAS replies, nothing
> would change and the GC would be a noop as far as the client is
> concerned. With the entire CAS coming back, some changes would be
> needed to compensate for deletion of pre-existing FS, but otherwise FS
> references would still be good. A binary serialization reply after GC
> would definitely invalidate references.
> 
> Eddie

That's all perfectly fine with me.

I have found the discussion again that I was referring to.  It wasn't
on this list, it was in the OASIS spec discussions.  Sorry about the
confusion.  I don't feel at liberty to publish that conversation here,
but maybe Adam would like to comment?  He and I were debating this at
the time (nearly two years ago).

--Thilo

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

On Mon, Feb 23, 2009 at 12:10 PM, Thilo Goetz <tw...@gmx.de> wrote:
> So the users are supposed to figure out if they need internal
> IDs?  I don't think that's a good idea.  Either we make guarantees
> about things like references into the CAS surviving calls to
> process(), or we don't.

Process calls to a Vinci service have always broken FS references.
Same for calls thru the compatibility wrapper that allows calling
colocated UIMA 1.4x annotators from Apache UIMA. So we have not in the
past made such guarantees for remote or even all colocated components.

Supporting GC called in a service will require some work. If the
client uses XMI serialization and supports delta CAS replies, nothing
would change and the GC would be a noop as far as the client is
concerned. With the entire CAS coming back, some changes would be
needed to compensate for deletion of pre-existing FS, but otherwise FS
references would still be good. A binary serialization reply after GC
would definitely invalidate references.

Eddie

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> Thilo Goetz wrote:
>> Marshall Schor wrote:
>>   
>>> One of the ideas for GC was to change the basic heap design to use java
>>> objects for feature structures.  I'm thinking of some kind of explicit
>>> GC, called by the user, at a point where they know a bunch of objects is
>>> no longer needed (because they've just deleted things out of the index,
>>> for instance).  The use case is one where some set of annotators might
>>> generate many "alternatives", and then a subsequent annotator "picks"
>>> one, and removes the others from the index.
>>>
>>> I'm thinking that the implementation might be based on the deep CAS copy
>>> code we already have, modified in an attempt to avoid needing extra space. 
>>>
>>> I think this would avoid many of the other issues mentioned in the
>>> previous thread http://markmail.org/thread/aolbz4nrvmgjhuyb.  If there
>>> are issues/concerns with this kind of approach, please post/discuss.
>>>     
>> It would change the internal IDs of FSs, which was always a
>> big no-no for some people.
>>   
> True.  For things like delta-cas, or parallel processing flows in
> UIMA-AS, which use a high-water mark of some kind, I'm thinking (hoping)
> we could make this work.
> 
> For other cases, I don't know how much an issue this is.  In any case,
> by having it be not-automatic, but rather user-invoked via some explicit
> call (e.g., myCas.reclaim-space()), I'm hoping (again) that only users
> who were not needing the internal IDs the same would call this. 

So the users are supposed to figure out if they need internal
IDs?  I don't think that's a good idea.  Either we make guarantees
about things like references into the CAS surviving calls to
process(), or we don't.

> 
> This was helpful - please post refs to other areas to look into before
> proceeding.
> 
> -Marshall
>>   
>>> -Marshall
>>>
>>> Thilo Goetz wrote:
>>>     
>>>> Marshall Schor wrote:
>>>>   
>>>>       
>>>>> Some users are beginning to ask for the ability to shift the internal
>>>>> tradeoffs UIMA takes toward having a smaller memory footprint, at some
>>>>> cost in performance.
>>>>>
>>>>> Several areas in particular have come up: 
>>>>>   1) "interning" string objects, so that only one copy exists
>>>>>   2) having some way to "compact" or garbage-collect the CAS
>>>>>     
>>>>>         
>>>> My suggestions for garbage collection in the CAS met with strong
>>>> resistance on this list in the past.  I'll be interested to see
>>>> what you'll come up with to overcome that resistance.
>>>>
>>>>   
>>>>       
>>>>> Are there other things that should be considered for trade-off here?
>>>>>
>>>>> -Marshall
>>>>>     
>>>>>         
>>>>   
>>>>       
>>
>>
>>

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

Thilo Goetz wrote:
> Marshall Schor wrote:
>   
>> One of the ideas for GC was to change the basic heap design to use java
>> objects for feature structures.  I'm thinking of some kind of explicit
>> GC, called by the user, at a point where they know a bunch of objects is
>> no longer needed (because they've just deleted things out of the index,
>> for instance).  The use case is one where some set of annotators might
>> generate many "alternatives", and then a subsequent annotator "picks"
>> one, and removes the others from the index.
>>
>> I'm thinking that the implementation might be based on the deep CAS copy
>> code we already have, modified in an attempt to avoid needing extra space. 
>>
>> I think this would avoid many of the other issues mentioned in the
>> previous thread http://markmail.org/thread/aolbz4nrvmgjhuyb.  If there
>> are issues/concerns with this kind of approach, please post/discuss.
>>     
>
> It would change the internal IDs of FSs, which was always a
> big no-no for some people.
>   
True.  For things like delta-cas, or parallel processing flows in
UIMA-AS, which use a high-water mark of some kind, I'm thinking (hoping)
we could make this work.

For other cases, I don't know how much an issue this is.  In any case,
by having it be not-automatic, but rather user-invoked via some explicit
call (e.g., myCas.reclaim-space()), I'm hoping (again) that only users
who were not needing the internal IDs the same would call this. 

This was helpful - please post refs to other areas to look into before
proceeding.

-Marshall
>   
>> -Marshall
>>
>> Thilo Goetz wrote:
>>     
>>> Marshall Schor wrote:
>>>   
>>>       
>>>> Some users are beginning to ask for the ability to shift the internal
>>>> tradeoffs UIMA takes toward having a smaller memory footprint, at some
>>>> cost in performance.
>>>>
>>>> Several areas in particular have come up: 
>>>>   1) "interning" string objects, so that only one copy exists
>>>>   2) having some way to "compact" or garbage-collect the CAS
>>>>     
>>>>         
>>> My suggestions for garbage collection in the CAS met with strong
>>> resistance on this list in the past.  I'll be interested to see
>>> what you'll come up with to overcome that resistance.
>>>
>>>   
>>>       
>>>> Are there other things that should be considered for trade-off here?
>>>>
>>>> -Marshall
>>>>     
>>>>         
>>>   
>>>       
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Eddie Epstein wrote:
> On Fri, Feb 20, 2009 at 7:23 AM, Thilo Goetz <tw...@gmx.de> wrote:
>> It would change the internal IDs of FSs, which was always a
>> big no-no for some people.
> 
> True, ID's would change, but this would be documented behavior, and
> there should be no problem if an annotator called GC just before
> returning from process(). By the way, is the discussion about changing
> IDs a big no-no documented anywhere?

Other than the archives of this list?  Probably not.

> 
> Eddie

Re: small memory footprint tradeoff configuration

Posted by Eddie Epstein <ea...@gmail.com>.

On Fri, Feb 20, 2009 at 7:23 AM, Thilo Goetz <tw...@gmx.de> wrote:
> It would change the internal IDs of FSs, which was always a
> big no-no for some people.

True, ID's would change, but this would be documented behavior, and
there should be no problem if an annotator called GC just before
returning from process(). By the way, is the discussion about changing
IDs a big no-no documented anywhere?

Eddie

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> One of the ideas for GC was to change the basic heap design to use java
> objects for feature structures.  I'm thinking of some kind of explicit
> GC, called by the user, at a point where they know a bunch of objects is
> no longer needed (because they've just deleted things out of the index,
> for instance).  The use case is one where some set of annotators might
> generate many "alternatives", and then a subsequent annotator "picks"
> one, and removes the others from the index.
> 
> I'm thinking that the implementation might be based on the deep CAS copy
> code we already have, modified in an attempt to avoid needing extra space. 
> 
> I think this would avoid many of the other issues mentioned in the
> previous thread http://markmail.org/thread/aolbz4nrvmgjhuyb.  If there
> are issues/concerns with this kind of approach, please post/discuss.

It would change the internal IDs of FSs, which was always a
big no-no for some people.

> 
> -Marshall
> 
> Thilo Goetz wrote:
>> Marshall Schor wrote:
>>   
>>> Some users are beginning to ask for the ability to shift the internal
>>> tradeoffs UIMA takes toward having a smaller memory footprint, at some
>>> cost in performance.
>>>
>>> Several areas in particular have come up: 
>>>   1) "interning" string objects, so that only one copy exists
>>>   2) having some way to "compact" or garbage-collect the CAS
>>>     
>> My suggestions for garbage collection in the CAS met with strong
>> resistance on this list in the past.  I'll be interested to see
>> what you'll come up with to overcome that resistance.
>>
>>   
>>> Are there other things that should be considered for trade-off here?
>>>
>>> -Marshall
>>>     
>>
>>
>>

Re: small memory footprint tradeoff configuration

Posted by Marshall Schor <ms...@schor.com>.

One of the ideas for GC was to change the basic heap design to use java
objects for feature structures.  I'm thinking of some kind of explicit
GC, called by the user, at a point where they know a bunch of objects is
no longer needed (because they've just deleted things out of the index,
for instance).  The use case is one where some set of annotators might
generate many "alternatives", and then a subsequent annotator "picks"
one, and removes the others from the index.

I'm thinking that the implementation might be based on the deep CAS copy
code we already have, modified in an attempt to avoid needing extra space. 

I think this would avoid many of the other issues mentioned in the
previous thread http://markmail.org/thread/aolbz4nrvmgjhuyb.  If there
are issues/concerns with this kind of approach, please post/discuss.

-Marshall

Thilo Goetz wrote:
> Marshall Schor wrote:
>   
>> Some users are beginning to ask for the ability to shift the internal
>> tradeoffs UIMA takes toward having a smaller memory footprint, at some
>> cost in performance.
>>
>> Several areas in particular have come up: 
>>   1) "interning" string objects, so that only one copy exists
>>   2) having some way to "compact" or garbage-collect the CAS
>>     
>
> My suggestions for garbage collection in the CAS met with strong
> resistance on this list in the past.  I'll be interested to see
> what you'll come up with to overcome that resistance.
>
>   
>> Are there other things that should be considered for trade-off here?
>>
>> -Marshall
>>     
>
>
>
>

Re: small memory footprint tradeoff configuration

Posted by Thilo Goetz <tw...@gmx.de>.

Marshall Schor wrote:
> Some users are beginning to ask for the ability to shift the internal
> tradeoffs UIMA takes toward having a smaller memory footprint, at some
> cost in performance.
> 
> Several areas in particular have come up: 
>   1) "interning" string objects, so that only one copy exists
>   2) having some way to "compact" or garbage-collect the CAS

My suggestions for garbage collection in the CAS met with strong
resistance on this list in the past.  I'll be interested to see
what you'll come up with to overcome that resistance.

> 
> Are there other things that should be considered for trade-off here?
> 
> -Marshall