You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Petr Baudis <pa...@ucw.cz> on 2015/07/10 00:52:01 UTC

UIMAj3 ideas

  Hi!

On Thu, Jul 09, 2015 at 03:51:26PM -0400, Marshall Schor wrote:
> The discussion of future directions for UIMA is spread over several pages in the
> wiki, but a good page to start is
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
On Thu, Jul 09, 2015 at 04:17:44PM -0400, Marshall Schor wrote:
> I'll take a look.  This kind of thing is "on the list" for uima v3; see
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3

  I didn't figure out how to edit that wiki page, but a mental summary
of the things I find currently irritating about UIMA and would love to
see changed formed in my mind, so I thought I could contribute it for
discussion.

  * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
    UIMA.  It seems to me that UIMA-AS is doing things a bit differently
    than what the original UIMA idea of doing scaleout was.  The two
    things don't play well together.  I'd love a way to easily take
    my plain UIMA pipeline and scale it out, ideally without any code
    changes, *and* avoid the terrible XML config files.

  * Speaking of avoiding the config files, it'd be nice if I could avoid
    them for type systems as well.  A radical idea: In the end, I treat
    UIMA essentially as a storage for Java objects; I suspect many others
    do the same.  I'd love a way to turn JCasGen on its head and write
    the Java classes (possibly with some restrictions) that I could
    store in UIMA, with the backend figuring out the low-level UIMA
    representation on its own.  This would radically reduce some aspects
    of the engineering overhead for me and maybe many other users.

  * The JCas UIMA interface should be more transparent in other ways
    too.  Working with arrays (and absence of lists) is a huge pain.
    I just want to work with feature structures as if they were normal
    Java objects, without major restrictions.

  * Connected with the above - I'd love .addToIndexes() to just
    disappear.  Right now, the paradigm is that you build an annotation
    in an annotator, and the moment it gets saved in a CAS, it becomes
    basically read-only.  But if I want e.g. to build up a set of
    features across multiple annotators, things again become very
    painful.  Because also fixed-size arrays, I need awful boilerplate
    code like

                AnswerInfo ai = JCasUtil.selectSingle(jcas, AnswerInfo.class);
                AnswerFV fv = new AnswerFV(ai);
                fv.setFeature(f, 1.0);

                for (FeatureStructure af : ai.getFeatures().toArray())
                        ((AnswerFeature) af).removeFromIndexes();
                ai.removeFromIndexes();

                ai.setFeatures(fv.toFSArray(jcas));
                ai.addToIndexes();

    simply to add a feature.  (Note the AnswerFV class, which is the
    actual thing I want to store in a JCas - a dynamic list of
    (feature_label, feature_value) pairs - but to do that it ends
    up being instead a complex factory of JCas FSes with a lot more
    boilerplate code inside.  Also note the typecast.)

  * I wondered about storing (arbitrary) graphs in the CAS, but the
    issues above make this really impractical.  If you also think about
    integrating microformats, you need to think about how to do this.

  * Complex pipelines are a bit clumsy.  I think the biggest obvious
    problem is lack of signalling to CAS merger that input CASes have
    been exhausted.  Having an "isLast" barrier sounds simple as long
    as you have only a single CAS multiplier paired with the CAS merger,
    but when this assumption breaks down, things start to deteriorate.
    However, I realize complex pipelines are a niche area.

  I think these are my main concerns.  I guess another way to phrase it:
I came to UIMA looking for a way to generate, store and organize
my+3rdparty Java object annotations of various text-based entities.
It sort of delivers, but if I did this again, I'd seriously hesitate
if the steep learning curve and incredible engineering overhead is worth
the deal.  I want to suggest that UIMAj3 would make me not hesitate, and
get out of my way! :)

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: UIMAj3 ideas

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 16.07.2015, at 20:37, Petr Baudis <pa...@ucw.cz> wrote:

> On Thu, Jul 16, 2015 at 08:00:35PM +0200, Richard Eckart de Castilho wrote:
>> On 16.07.2015, at 18:52, Petr Baudis <pa...@ucw.cz> wrote:
>>> Sorry for the confusion, but that's not quite what I had in mind.
>>> I literally believe that right now, in order to modify value of
>>> a feature, you need to first remove it from an index, change the
>>> value, then re-add it back.  Is that a misconception?
>> 
>> Well, yes and no. Yes, it was required for the case where the value that
>> you changed was on a feature that was part of some index. No, it should
>> no longer be required as measures have been implemented to handle this
>> automatically.
>> 
>> See: "The curious case of the zombie annotation" aka UIMA-4049
>> 
>> https://issues.apache.org/jira/browse/UIMA-4049
> 
>  That's great to hear!  However, when reading the bug report and
> looking closely at that part of the release notes, I think "it should no
> longer be required" isn't quite precise as changing indexed features
> might cause an exception to be thrown by an iterator that goes through
> these at the same time (so the fix for that is to use a snapshot
> iterator, and that sounds reasonable, more so when JCasUtil gets support
> for them - sorry if it did and I missed it, I'm still stuck on UIMA 2.6
> for now anyway until the next release with fixed CasCopier).

uimaFIT doesn't use them just yet. I might find some time to actually
do a new uimaFIT release before September, but more probably end of 
December or early January. If there are any blocking issues, I'll
of course try to fix them but only in the SNAPSHOT.

Cheers,

-- Richard

Re: UIMAj3 ideas

Posted by Petr Baudis <pa...@ucw.cz>.
On Thu, Jul 16, 2015 at 08:00:35PM +0200, Richard Eckart de Castilho wrote:
> On 16.07.2015, at 18:52, Petr Baudis <pa...@ucw.cz> wrote:
> >  Sorry for the confusion, but that's not quite what I had in mind.
> > I literally believe that right now, in order to modify value of
> > a feature, you need to first remove it from an index, change the
> > value, then re-add it back.  Is that a misconception?
> 
> Well, yes and no. Yes, it was required for the case where the value that
> you changed was on a feature that was part of some index. No, it should
> no longer be required as measures have been implemented to handle this
> automatically.
> 
> See: "The curious case of the zombie annotation" aka UIMA-4049
> 
> https://issues.apache.org/jira/browse/UIMA-4049

  That's great to hear!  However, when reading the bug report and
looking closely at that part of the release notes, I think "it should no
longer be required" isn't quite precise as changing indexed features
might cause an exception to be thrown by an iterator that goes through
these at the same time (so the fix for that is to use a snapshot
iterator, and that sounds reasonable, more so when JCasUtil gets support
for them - sorry if it did and I missed it, I'm still stuck on UIMA 2.6
for now anyway until the next release with fixed CasCopier).

> >  I think that's a bug for the UIMA Tutorial, which mentions FSArray but
> > not FSList.  :-)
> 
> Then I should tell you also about the uimaFIT FSCollectionFactory which
> contains all kinds of helpers to manage FSArray and FSList ;)
> 
> Btw. there is also ArrayFS which is the CAS version of FSArray :P
..
> Did you know that uimaFIT JCasUtil.select() can also be applied to
> FSList and FSArray to avoid casting?
> 
> for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) {
>   ...
> }
> 
> CasUtil.select() can work also on ArrayFS

  So many great news! Thanks so much for these.  We'll certainly start
using them in new code. :-)

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: UIMAj3 ideas

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 16.07.2015, at 18:52, Petr Baudis <pa...@ucw.cz> wrote:

> On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote:
>>> 
>>>  * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
>>>    UIMA.  It seems to me that UIMA-AS is doing things a bit differently
>>>    than what the original UIMA idea of doing scaleout was.  The two
>>>    things don't play well together.  I'd love a way to easily take
>>>    my plain UIMA pipeline and scale it out, ideally without any code
>>>    changes, *and* avoid the terrible XML config files.
>> Any specifics of what to change here would be helpful.  UIMA-AS was designed to
>> enable scale-out without changing the core UIMA pipeline or it's XML
>> descriptor.  THe additional information for UIMA-AS scaleout was put into a
>> separate xml descriptor which "embeds" the original plain UIMA one.
> 
>  I'm sure Richard would be able to explain this better, but I think one
> of the core issues is that UIMA-AS embeds the XML descriptor instead of
> the AnalysisEngineDescription.  So when I want to use it together with
> AnalysisEngineDescription built with UIMAfit instead, it's time to
> start making crazy workarounds like

Afaik, there is no API in UIMA-AS that allows inject an AnalysisEngineDescription
into an UIMA-AS descriptor. UIMA-AS forces one to use an import, so the AED
needs to be serialized and then imported again by UIMA-AS... or I just never
found the right method call or missed when it was added. In fact, I didn't
even find an API to programmatically create a UIMA-AS descriptor and at the
time saw myself forced to implement a "AsDeploymentDescription.java" myself.

See: https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/

>>>  * Connected with the above - I'd love .addToIndexes() to just
>>>    disappear.  Right now, the paradigm is that you build an annotation
>>>    in an annotator, and the moment it gets saved in a CAS, it becomes
>>>    basically read-only.  
>> You certainly can modify any of an Annotation's features subsequently.
>> I'm guessing you're referring to another idea - adding additional features that were
>> not initially defined in the UIMA type system.
> 
>  Sorry for the confusion, but that's not quite what I had in mind.
> I literally believe that right now, in order to modify value of
> a feature, you need to first remove it from an index, change the
> value, then re-add it back.  Is that a misconception?

Well, yes and no. Yes, it was required for the case where the value that
you changed was on a feature that was part of some index. No, it should
no longer be required as measures have been implemented to handle this
automatically.

See: "The curious case of the zombie annotation" aka UIMA-4049

https://issues.apache.org/jira/browse/UIMA-4049

>  I think that's a bug for the UIMA Tutorial, which mentions FSArray but
> not FSList.  :-)

Then I should tell you also about the uimaFIT FSCollectionFactory which
contains all kinds of helpers to manage FSArray and FSList ;)

Btw. there is also ArrayFS which is the CAS version of FSArray :P

>  (Another pain point here - I always ache when I need to work with
> FSArray or I guess FSList, since it does not carry the type information
> that is in the typesystem - I need to manually typecast all the time
> and hope I don't make a mistake.)

Did you know that uimaFIT JCasUtil.select() can also be applied to
FSList and FSArray to avoid casting?

for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) {
  ...
}

CasUtil.select() can work also on ArrayFS

Cheerio,

-- Richard

Re: UIMAj3 ideas

Posted by Petr Baudis <pa...@ucw.cz>.
On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote:
> On 7/9/2015 6:52 PM, Petr Baudis wrote:
> <snip...>
> 
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
> 
> >   I didn't figure out how to edit that wiki page, 
> Due to spammers, we had to turn off public editing.  However, I can add you to a
> list ( to do this, you have to "register" for a user id on the wiki, and then
> send me offline what that Id is ), but even without being on the list, there's a
> comment button which (I think) lets you add comments at the bottom.
> > but a mental summary
> > of the things I find currently irritating about UIMA and would love to
> > see changed formed in my mind, so I thought I could contribute it for
> > discussion.
> Great!
> >
> >   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
> >     UIMA.  It seems to me that UIMA-AS is doing things a bit differently
> >     than what the original UIMA idea of doing scaleout was.  The two
> >     things don't play well together.  I'd love a way to easily take
> >     my plain UIMA pipeline and scale it out, ideally without any code
> >     changes, *and* avoid the terrible XML config files.
> Any specifics of what to change here would be helpful.  UIMA-AS was designed to
> enable scale-out without changing the core UIMA pipeline or it's XML
> descriptor.  THe additional information for UIMA-AS scaleout was put into a
> separate xml descriptor which "embeds" the original plain UIMA one.

  I'm sure Richard would be able to explain this better, but I think one
of the core issues is that UIMA-AS embeds the XML descriptor instead of
the AnalysisEngineDescription.  So when I want to use it together with
AnalysisEngineDescription built with UIMAfit instead, it's time to
start making crazy workarounds like

	https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1&r=14aeba50c8c18ea4d14c0d099f43c049f806d9db

> >   * Connected with the above - I'd love .addToIndexes() to just
> >     disappear.  Right now, the paradigm is that you build an annotation
> >     in an annotator, and the moment it gets saved in a CAS, it becomes
> >     basically read-only.  
> You certainly can modify any of an Annotation's features subsequently.
> I'm guessing you're referring to another idea - adding additional features that were
> not initially defined in the UIMA type system.

  Sorry for the confusion, but that's not quite what I had in mind.
I literally believe that right now, in order to modify value of
a feature, you need to first remove it from an index, change the
value, then re-add it back.  Is that a misconception?

> UIMA sets up the types and
> features once at the start of the pipeline run (from a merge of all the
> component's type systems), and locks down the type system.  Other frameworks
> sometimes allow an unlocked type system, where you could add (after a Feature
> Structure is created) additional features.  This is usually done by keeping a
> list of feature-name <-> feature-value pairs (such as your code snippet does,
> below).  We're thinking of including this capability in the version 3, with a
> bit of a twist - the intent would be to keep the "compilable" aspect of
> "locked-down" type/features (for high performance), while adding (for those use
> cases that want it) the other style of dynamically added additional features (at
> some cost in performance).  

  Still, this would be awesome and I'd totally make use of it!

  (The code in my original email I guess conflates demonstration of two
issues - the addToIndex and lack of variable-sized lists, i.e. the java
collection support issue.  Even if you decide generic collection / map
support would be too tricky, at least supporting variable-sized lists
would help a lot...)

> >   * I wondered about storing (arbitrary) graphs in the CAS, but the
> >     issues above make this really impractical.  If you also think about
> >     integrating microformats, you need to think about how to do this.
> We have had users store arbitrary graphs in the CAS, but, yes, it is not so
> efficient.  The main element UIMA has for collections of references (to
> FeatureStructures) are the FSArray and FSList.  As you point out the FSArray is
> fixed length.  The FSList supports dynamic adding/removing etc. using the
> standard link-list technology.  However, because UIMA data in the CAS
> (currently) is not garbage collected, you have to be careful when using this
> technique.

  ...oh, never mind.  After using UIMA heavily for well over a year,
I managed not to learn that FSList exists at all!  Thanks for this
pointer.

  I think that's a bug for the UIMA Tutorial, which mentions FSArray but
not FSList.  :-)

  (Another pain point here - I always ache when I need to work with
FSArray or I guess FSList, since it does not carry the type information
that is in the typesystem - I need to manually typecast all the time
and hope I don't make a mistake.)

> The above proposal to allow the common Java Collection objects (like ArrayList,
> and Maps) as things in the CAS, plus garbage collection,should make it much more
> convenient to store and work with graphs in the CAS.
> >
> >   * Complex pipelines are a bit clumsy.  I think the biggest obvious
> >     problem is lack of signalling to CAS merger that input CASes have
> >     been exhausted.  Having an "isLast" barrier sounds simple as long
> >     as you have only a single CAS multiplier paired with the CAS merger,
> >     but when this assumption breaks down, things start to deteriorate.
> >     However, I realize complex pipelines are a niche area.
> It would be nice to hear some ideas here.

  (After reading Eddie Epstein's email and coming back to some more of
his emails to me, I realize that the isLast hack I'm using is needless
if I would instead use the "process-parent-last" flag of CASMultiplier.
I'm learning a lot from interacting here!  I guess that shows we could
always make use of more good UIMA code examples...)

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: UIMAj3 ideas

Posted by Marshall Schor <ms...@schor.com>.
On 7/9/2015 6:52 PM, Petr Baudis wrote:
<snip...>

https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3

>   I didn't figure out how to edit that wiki page, 
Due to spammers, we had to turn off public editing.  However, I can add you to a
list ( to do this, you have to "register" for a user id on the wiki, and then
send me offline what that Id is ), but even without being on the list, there's a
comment button which (I think) lets you add comments at the bottom.
> but a mental summary
> of the things I find currently irritating about UIMA and would love to
> see changed formed in my mind, so I thought I could contribute it for
> discussion.
Great!
>
>   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
>     UIMA.  It seems to me that UIMA-AS is doing things a bit differently
>     than what the original UIMA idea of doing scaleout was.  The two
>     things don't play well together.  I'd love a way to easily take
>     my plain UIMA pipeline and scale it out, ideally without any code
>     changes, *and* avoid the terrible XML config files.
Any specifics of what to change here would be helpful.  UIMA-AS was designed to
enable scale-out without changing the core UIMA pipeline or it's XML
descriptor.  THe additional information for UIMA-AS scaleout was put into a
separate xml descriptor which "embeds" the original plain UIMA one.

>
>   * Speaking of avoiding the config files, it'd be nice if I could avoid
>     them for type systems as well.  A radical idea: In the end, I treat
>     UIMA essentially as a storage for Java objects; I suspect many others
>     do the same.  I'd love a way to turn JCasGen on its head and write
>     the Java classes (possibly with some restrictions) that I could
>     store in UIMA, with the backend figuring out the low-level UIMA
>     representation on its own.  This would radically reduce some aspects
>     of the engineering overhead for me and maybe many other users.
Interesting idea.  I'll add it to the list.
>   * The JCas UIMA interface should be more transparent in other ways
>     too.  Working with arrays (and absence of lists) is a huge pain.
>     I just want to work with feature structures as if they were normal
>     Java objects, without major restrictions.
This is one of the version 3 ideas: see
https://cwiki.apache.org/confluence/display/UIMA/Supporting+Java+Collections+and+Maps+as+UIMA+Feature+Structures
>
>   * Connected with the above - I'd love .addToIndexes() to just
>     disappear.  Right now, the paradigm is that you build an annotation
>     in an annotator, and the moment it gets saved in a CAS, it becomes
>     basically read-only.  
You certainly can modify any of an Annotation's features subsequently.  I'm
guessing you're referring to another idea - adding additional features that were
not initially defined in the UIMA type system.  UIMA sets up the types and
features once at the start of the pipeline run (from a merge of all the
component's type systems), and locks down the type system.  Other frameworks
sometimes allow an unlocked type system, where you could add (after a Feature
Structure is created) additional features.  This is usually done by keeping a
list of feature-name <-> feature-value pairs (such as your code snippet does,
below).  We're thinking of including this capability in the version 3, with a
bit of a twist - the intent would be to keep the "compilable" aspect of
"locked-down" type/features (for high performance), while adding (for those use
cases that want it) the other style of dynamically added additional features (at
some cost in performance).  
> But if I want e.g. to build up a set of
>     features across multiple annotators, things again become very
>     painful.  Because also fixed-size arrays, I need awful boilerplate
>     code like
>
>                 AnswerInfo ai = JCasUtil.selectSingle(jcas, AnswerInfo.class);
>                 AnswerFV fv = new AnswerFV(ai);
>                 fv.setFeature(f, 1.0);
>
>                 for (FeatureStructure af : ai.getFeatures().toArray())
>                         ((AnswerFeature) af).removeFromIndexes();
>                 ai.removeFromIndexes();
>
>                 ai.setFeatures(fv.toFSArray(jcas));
>                 ai.addToIndexes();
>
>     simply to add a feature.  (Note the AnswerFV class, which is the
>     actual thing I want to store in a JCas - a dynamic list of
>     (feature_label, feature_value) pairs - but to do that it ends
>     up being instead a complex factory of JCas FSes with a lot more
>     boilerplate code inside.  Also note the typecast.)
>
>   * I wondered about storing (arbitrary) graphs in the CAS, but the
>     issues above make this really impractical.  If you also think about
>     integrating microformats, you need to think about how to do this.
We have had users store arbitrary graphs in the CAS, but, yes, it is not so
efficient.  The main element UIMA has for collections of references (to
FeatureStructures) are the FSArray and FSList.  As you point out the FSArray is
fixed length.  The FSList supports dynamic adding/removing etc. using the
standard link-list technology.  However, because UIMA data in the CAS
(currently) is not garbage collected, you have to be careful when using this
technique.

The above proposal to allow the common Java Collection objects (like ArrayList,
and Maps) as things in the CAS, plus garbage collection,should make it much more
convenient to store and work with graphs in the CAS.
>
>   * Complex pipelines are a bit clumsy.  I think the biggest obvious
>     problem is lack of signalling to CAS merger that input CASes have
>     been exhausted.  Having an "isLast" barrier sounds simple as long
>     as you have only a single CAS multiplier paired with the CAS merger,
>     but when this assumption breaks down, things start to deteriorate.
>     However, I realize complex pipelines are a niche area.
It would be nice to hear some ideas here.
>
>   I think these are my main concerns.  I guess another way to phrase it:
> I came to UIMA looking for a way to generate, store and organize
> my+3rdparty Java object annotations of various text-based entities.
> It sort of delivers, but if I did this again, I'd seriously hesitate
> if the steep learning curve and incredible engineering overhead is worth
> the deal.  I want to suggest that UIMAj3 would make me not hesitate, and
> get out of my way! :)
Some of the other things we're thinking about are ways to get more out of the
way and integrate with other "popular" systems.  Any constructive thoughts here
are appreciated!

Thanks for your input.

-Marshall

Re: UIMAj3 ideas

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 16.07.2015, at 23:10, Jaroslaw Cwiklik <ui...@gmail.com> wrote:

> The UIMA-AS *does* have an API to generate deployment descriptors although
> its not documented. Its an internal API for now and most likely will be
> documented in the next release of UIMA-AS. The API is implemented by
> DeploymentDescriptorFactory.java. in the uimaj-as-core project.

Cool :) *thumbs up*

-- Richard

Re: Creating UIMA-AS deployment descriptors programmatically

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.
Yes, I forgot about this. Its a minimal documentation which describes
primitive deployment. More complex deployments are supported but not
documented. I think more work is needed to clean up the API and when done
more documentation is necessary. This is work in progress.

-jerry

On Fri, Aug 12, 2016 at 10:30 AM, Richard Eckart de Castilho <rec@apache.org
> wrote:

> It's in the documentation. That's how I stumbled over it again and tried
> to remember why back in the day I had written my own factory.
>
> https://uima.apache.org/d/uima-as-2.8.1/uima_async_
> scaleout.html#ref.async.api.descriptor.generation
>
> Cheers,
>
> -- Richard
>
> > On 12.08.2016, at 16:28, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
> >
> > I think this is documented in the code only for now and not in the
> UIMA-AS
> > documentation. This API still needs work. I was thinking of changing this
> > to use Builder pattern to configure deployment using a series of set/add
> > calls instead of passing many parameters.
> > I can enhance the code to support your suggestion. I will create a new
> JIRA
> > to capture this requirement.
> > Thanks
> >
> > -jerry
> >
> >
> >
> > On Thu, Aug 11, 2016 at 1:26 PM, Richard Eckart de Castilho <
> rec@apache.org>
> > wrote:
> >
> >> On 16.07.2015, at 23:10, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
> >>>
> >>> The UIMA-AS *does* have an API to generate deployment descriptors
> >> although
> >>> its not documented. Its an internal API for now and most likely will be
> >>> documented in the next release of UIMA-AS. The API is implemented by
> >>> DeploymentDescriptorFactory.java. in the uimaj-as-core project.
> >>
> >> I see this is documented now.
> >>
> >> Would be nice if one could directly set an AnalyisEngineDescriptor in
> the
> >> ServiceContextImpl instead of having to first serialize the AED to a
> file.
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
>
>

Re: Creating UIMA-AS deployment descriptors programmatically

Posted by Richard Eckart de Castilho <re...@apache.org>.
It's in the documentation. That's how I stumbled over it again and tried
to remember why back in the day I had written my own factory.

https://uima.apache.org/d/uima-as-2.8.1/uima_async_scaleout.html#ref.async.api.descriptor.generation

Cheers,

-- Richard

> On 12.08.2016, at 16:28, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
> 
> I think this is documented in the code only for now and not in the UIMA-AS
> documentation. This API still needs work. I was thinking of changing this
> to use Builder pattern to configure deployment using a series of set/add
> calls instead of passing many parameters.
> I can enhance the code to support your suggestion. I will create a new JIRA
> to capture this requirement.
> Thanks
> 
> -jerry
> 
> 
> 
> On Thu, Aug 11, 2016 at 1:26 PM, Richard Eckart de Castilho <re...@apache.org>
> wrote:
> 
>> On 16.07.2015, at 23:10, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
>>> 
>>> The UIMA-AS *does* have an API to generate deployment descriptors
>> although
>>> its not documented. Its an internal API for now and most likely will be
>>> documented in the next release of UIMA-AS. The API is implemented by
>>> DeploymentDescriptorFactory.java. in the uimaj-as-core project.
>> 
>> I see this is documented now.
>> 
>> Would be nice if one could directly set an AnalyisEngineDescriptor in the
>> ServiceContextImpl instead of having to first serialize the AED to a file.
>> 
>> Cheers,
>> 
>> -- Richard
>> 


Re: Creating UIMA-AS deployment descriptors programmatically

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.
I think this is documented in the code only for now and not in the UIMA-AS
documentation. This API still needs work. I was thinking of changing this
to use Builder pattern to configure deployment using a series of set/add
calls instead of passing many parameters.
I can enhance the code to support your suggestion. I will create a new JIRA
to capture this requirement.
Thanks

-jerry



On Thu, Aug 11, 2016 at 1:26 PM, Richard Eckart de Castilho <re...@apache.org>
wrote:

> On 16.07.2015, at 23:10, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
> >
> > The UIMA-AS *does* have an API to generate deployment descriptors
> although
> > its not documented. Its an internal API for now and most likely will be
> > documented in the next release of UIMA-AS. The API is implemented by
> > DeploymentDescriptorFactory.java. in the uimaj-as-core project.
>
> I see this is documented now.
>
> Would be nice if one could directly set an AnalyisEngineDescriptor in the
> ServiceContextImpl instead of having to first serialize the AED to a file.
>
> Cheers,
>
> -- Richard
>

Creating UIMA-AS deployment descriptors programmatically

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 16.07.2015, at 23:10, Jaroslaw Cwiklik <ui...@gmail.com> wrote:
> 
> The UIMA-AS *does* have an API to generate deployment descriptors although
> its not documented. Its an internal API for now and most likely will be
> documented in the next release of UIMA-AS. The API is implemented by
> DeploymentDescriptorFactory.java. in the uimaj-as-core project.

I see this is documented now.

Would be nice if one could directly set an AnalyisEngineDescriptor in the
ServiceContextImpl instead of having to first serialize the AED to a file.

Cheers,

-- Richard

Re: UIMAj3 ideas

Posted by Jaroslaw Cwiklik <ui...@gmail.com>.
The UIMA-AS *does* have an API to generate deployment descriptors although
its not documented. Its an internal API for now and most likely will be
documented in the next release of UIMA-AS. The API is implemented by
 DeploymentDescriptorFactory.java. in the uimaj-as-core project.

Jerry

On Thu, Jul 16, 2015 at 4:56 PM, Thomas Ginter <th...@utah.edu>
wrote:

> Richard,
>
> There is an API in UIMA for generating Analysis Engine Descriptors as well
> as Aggregates and Type System descriptions.  I use that API to generate the
> xml descriptor at runtime after the configuration has been completed.  I
> wrote my own logic to track the delegates of an Aggregate descriptor in
> order to propagate updates to/from delegates to allow the user to
> dynamically specify Analysis Engine parameters.  I also merged the scale
> out parameters for UIMA-AS into the Analysis Engine object for ease of
> configuration.
>
> In addition I wrote my own code to generate the deployment descriptor from
> the programmatic parameters provided.  The resulting XML is what the
> framework uses to generate the Spring Bean file you mentioned.
>
> That being said the existing API definitely has a learning curve which was
> part of the motivation for creating Leo.
>
> Thanks,
>
> Thomas Ginter
> 801-448-7676
> thomas.ginter@utah.edu
>
>
>
>
> > On Jul 16, 2015, at 1:51 PM, Richard Eckart de Castilho <re...@apache.org>
> wrote:
> >
> > Hi Thomas,
> >
> > On 16.07.2015, at 21:42, Thomas Ginter <th...@utah.edu> wrote:
> >
> >> Have you looked into using Leo?  It allows you to programmatically
> create Analysis Engines, Aggregates, the type system, and launch everything
> in UIMA-AS without having to manage any XML descriptors at all.
> Furthermore it is available via Maven so your code can compile an run.
> >
> > Did you find an API in UIMA AS to handle the programmatic generation of
> descriptors, or did you implement that yourself in Leo (as I had tried to
> in DKPro Lab)?
> >
> > If I remember correctly, then UIMA AS loaded plain XML descriptor files,
> transforms them to a Spring Bean file using XSLT and then used Spring to
> instantiate it. But I may have missed something.
> >
> > Cheers,
> >
> > -- Richard
>
>

Re: UIMAj3 ideas

Posted by Richard Eckart de Castilho <re...@apache.org>.
Thomas,

On 16.07.2015, at 22:56, Thomas Ginter <th...@utah.edu> wrote:

> There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions.  I use that API to generate the xml descriptor at runtime after the configuration has been completed.  I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters.  I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration.  

we're using the plain UIMA APIs for AED and friends in uimaFIT too - those APIs being not too user-friendly and XML being a pain was the major motivation to come up with uimaFIT. However, uimaFIT doesn't aspire to drive UIMA AS, just to make the core UIMA descriptors easier to handle.

> In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided.  The resulting XML is what the framework uses to generate the Spring Bean file you mentioned.


So what you say confirms my findings. I never found a corresponding API for UIMA deployment descriptors in UIMA AS. It would have been great if UIMA AS had provided at least some basic API for deployment descriptors parallel to what UIMA offers for engines and aggregates.

> That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo.

Same for uimaFIT ;) 

Cheers,

-- Richard

Re: UIMAj3 ideas

Posted by Thomas Ginter <th...@utah.edu>.
Richard,

There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions.  I use that API to generate the xml descriptor at runtime after the configuration has been completed.  I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters.  I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration.  

In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided.  The resulting XML is what the framework uses to generate the Spring Bean file you mentioned.

That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo.

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu




> On Jul 16, 2015, at 1:51 PM, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> Hi Thomas,
> 
> On 16.07.2015, at 21:42, Thomas Ginter <th...@utah.edu> wrote:
> 
>> Have you looked into using Leo?  It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all.  Furthermore it is available via Maven so your code can compile an run.  
> 
> Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? 
> 
> If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something.
> 
> Cheers,
> 
> -- Richard


Re: UIMAj3 ideas

Posted by Richard Eckart de Castilho <re...@apache.org>.
Hi Thomas,

On 16.07.2015, at 21:42, Thomas Ginter <th...@utah.edu> wrote:

> Have you looked into using Leo?  It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all.  Furthermore it is available via Maven so your code can compile an run.  

Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? 

If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something.

Cheers,

-- Richard 

Re: UIMAj3 ideas

Posted by Petr Baudis <pa...@ucw.cz>.
  Hi!

On Thu, Jul 16, 2015 at 07:42:58PM +0000, Thomas Ginter wrote:
> Have you looked into using Leo?  It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all.  Furthermore it is available via Maven so your code can compile an run.  
> 
> http://department-of-veterans-affairs.github.io/Leo/userguide.html

  I had a look, but got the impression that I'd have to rewrite most
of my pipeline generation code, and it's not small code.  Also, it's
not clear to me from Leo's docs whether and/or how it supports CAS
multipliers and mergers, there seem to be no references to that.

  This impression might have been wrong, but overally I'd just welcome
if I could stick with stock UIMA for scaleout at least in the form
of multi-threading without cluster scaleout (which I think many UIMA
users would welcome, and much smaller percentage wants to deploy to
a cluster), that's what I was trying to say originally.

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: UIMAj3 ideas

Posted by Thomas Ginter <th...@utah.edu>.
Hi Petr,

Have you looked into using Leo?  It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all.  Furthermore it is available via Maven so your code can compile an run.  

http://department-of-veterans-affairs.github.io/Leo/userguide.html

The only catch to running UIMA-AS is making sure the broker is running.  A manual step that we have not yet automated.  Other than that it can scale most pipelines with the notable exception of pipelines that have really large resources.

As for ideas for UIMA 3 I would love to see a much simpler CAS system that didn’t require a pre-definition of types before execution.  Such as a very simple abstract base class that defines an “annotation” and is then extended in order to create/use a new type.  It seems like the basic location based indexes could still be provided that way as well as the option of extending to provide custom indexes.  If the CAS was implemented as a base set of very simple Java objects we would also have more serialization options.  Possibly even making it possible for the user to plug in a different serializer if required such as protobuff.  Just a thought.

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu




> On Jul 16, 2015, at 10:25 AM, Petr Baudis <pa...@ucw.cz> wrote:
> 
>  Hi!
> 
> On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote:
>> Good comments which will likely generate lots of responses.
>> For now please see comments on scaleout below.
>> 
>> On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis <pa...@ucw.cz> wrote:
>> 
>>>  * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
>>>    UIMA.  It seems to me that UIMA-AS is doing things a bit differently
>>>    than what the original UIMA idea of doing scaleout was.  The two
>>>    things don't play well together.  I'd love a way to easily take
>>>    my plain UIMA pipeline and scale it out, ideally without any code
>>>    changes, *and* avoid the terrible XML config files.
>>> 
>>> 
>> Not clear what you are referring to as the "original UIMA idea of doing
>> scaleout",
>> the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS
>> is also an embeddable framework that offers flexible vertical
>> (multi-threading) and
>> horizontal (multi-process) options for deploying an arbitrary pipeline.
>> Admittedly
>> scaleout with UIMA-AS is complicated and the minimal support for process
>> management make it difficult to do scaleout simply. In what ways do you
>> think
>> UIMA-AS is inconsistent with UIMA or UIMA scaleout?
> 
>  Well, my impression after delving into some UIMA internals was that
> the original idea was to use the Analysis Structure Broker to control
> the pipeline flow and it would seem natural that when doing scale-out,
> one would simply provide a different ASB.  Its javadoc even reads
> 
>> The Analysis Structure Broker (<code>ASB</code>) is the component
>> responsible for the details of communicating with Analysis Engines
>> that may potentially be distributed across different physical
>> machines.
> 
> Of course, maybe I got it wrong.
> 
>> DUCC is full cluster management application that will scaleout a plain UIMA
>> pipeline with no code changes, assuming that the application code is
>> threadsafe.
>> But a typical pipeline with a single collection reader creating input CASes
>> and
>> a single cas consumer will limit scaleout performance pretty quickly. DUCC
>> makes it easyto eliminate the input data bottleneck. DUCC sample apps
>> show one approach to eliminating the output bottleneck. Have you looked at
>> DUCC?
> 
>  I use UIMA pipeline for question answering, where each question
> currently takes ~30s (single-threaded) to process (a lot of it spent
> waiting on databases), so I don't think I'd hit such a bottleneck.
> I did spend a few tens of minutes looking at DUCC, but I got the
> impression that it's not really trivial to set up.
> 
>  One of my goals is to minimize setup hassles for anyone who wants to
> run my software - ideally, they should be able to just compile and run.
> If I started to use DUCC, I'm not sure to what degree I could preserve
> this, but at least it's another element in the already steep learning
> curve for anyone who wants to tinker with the system.
> 
>  (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory
> resource sharing - though from one of your previous emails, I got the
> impression that I could run multiple AEs in threads of a single java
> process; but I guess at that point I was already decided that I want
> to try something less complex.)
> 
> -- 
> 				Petr Baudis
> 	If you have good ideas, good data and fast computers,
> 	you can do almost anything. -- Geoffrey Hinton


Re: UIMAj3 ideas

Posted by Petr Baudis <pa...@ucw.cz>.
  Hi!

On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote:
> Good comments which will likely generate lots of responses.
> For now please see comments on scaleout below.
> 
> On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis <pa...@ucw.cz> wrote:
> 
> >   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
> >     UIMA.  It seems to me that UIMA-AS is doing things a bit differently
> >     than what the original UIMA idea of doing scaleout was.  The two
> >     things don't play well together.  I'd love a way to easily take
> >     my plain UIMA pipeline and scale it out, ideally without any code
> >     changes, *and* avoid the terrible XML config files.
> >
> >
> Not clear what you are referring to as the "original UIMA idea of doing
> scaleout",
> the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS
> is also an embeddable framework that offers flexible vertical
> (multi-threading) and
> horizontal (multi-process) options for deploying an arbitrary pipeline.
> Admittedly
> scaleout with UIMA-AS is complicated and the minimal support for process
> management make it difficult to do scaleout simply. In what ways do you
> think
> UIMA-AS is inconsistent with UIMA or UIMA scaleout?

  Well, my impression after delving into some UIMA internals was that
the original idea was to use the Analysis Structure Broker to control
the pipeline flow and it would seem natural that when doing scale-out,
one would simply provide a different ASB.  Its javadoc even reads

> The Analysis Structure Broker (<code>ASB</code>) is the component
> responsible for the details of communicating with Analysis Engines
> that may potentially be distributed across different physical
> machines.

Of course, maybe I got it wrong.

> DUCC is full cluster management application that will scaleout a plain UIMA
> pipeline with no code changes, assuming that the application code is
> threadsafe.
> But a typical pipeline with a single collection reader creating input CASes
> and
> a single cas consumer will limit scaleout performance pretty quickly. DUCC
> makes it easyto eliminate the input data bottleneck. DUCC sample apps
> show one approach to eliminating the output bottleneck. Have you looked at
> DUCC?

  I use UIMA pipeline for question answering, where each question
currently takes ~30s (single-threaded) to process (a lot of it spent
waiting on databases), so I don't think I'd hit such a bottleneck.
I did spend a few tens of minutes looking at DUCC, but I got the
impression that it's not really trivial to set up.

  One of my goals is to minimize setup hassles for anyone who wants to
run my software - ideally, they should be able to just compile and run.
If I started to use DUCC, I'm not sure to what degree I could preserve
this, but at least it's another element in the already steep learning
curve for anyone who wants to tinker with the system.

  (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory
resource sharing - though from one of your previous emails, I got the
impression that I could run multiple AEs in threads of a single java
process; but I guess at that point I was already decided that I want
to try something less complex.)

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton

Re: UIMAj3 ideas

Posted by Eddie Epstein <ea...@gmail.com>.
Hi Petr,

Good comments which will likely generate lots of responses.
For now please see comments on scaleout below.

On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis <pa...@ucw.cz> wrote:

>   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
>     UIMA.  It seems to me that UIMA-AS is doing things a bit differently
>     than what the original UIMA idea of doing scaleout was.  The two
>     things don't play well together.  I'd love a way to easily take
>     my plain UIMA pipeline and scale it out, ideally without any code
>     changes, *and* avoid the terrible XML config files.
>
>
Not clear what you are referring to as the "original UIMA idea of doing
scaleout",
the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS
is also an embeddable framework that offers flexible vertical
(multi-threading) and
horizontal (multi-process) options for deploying an arbitrary pipeline.
Admittedly
scaleout with UIMA-AS is complicated and the minimal support for process
management make it difficult to do scaleout simply. In what ways do you
think
UIMA-AS is inconsistent with UIMA or UIMA scaleout?

DUCC is full cluster management application that will scaleout a plain UIMA
pipeline with no code changes, assuming that the application code is
threadsafe.
But a typical pipeline with a single collection reader creating input CASes
and
a single cas consumer will limit scaleout performance pretty quickly. DUCC
makes it easyto eliminate the input data bottleneck. DUCC sample apps
show one approach to eliminating the output bottleneck. Have you looked at
DUCC?

Regards,
Eddie