You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Joern Kottmann <ko...@gmail.com> on 2015/07/23 14:43:09 UTC

Re: Ideas for UIMA v3

On Wed, 2015-06-24 at 14:31 +0200, Thilo Goetz wrote:
> > Marshall already did some nice work on JSON serialization, so I
> think there is movement into that direction.
>
> Just to be very clear: that is not good enough. I want a JSON format
> that I can read and write without the help of the framework. From my
> datastructures, into my datastructures. In some programming language
> that hasn't been invented yet. Simple enough that I don't need to
> absorb
> and reimplement the whole UIMA philosophy.
>
> >
> > But what I don't understand is how a data format resolves to "less
> framework". The data format is basically addressing ingestion and
> export, but not processing or pipelines. Even if you have a simple
> data format like JSON, there's still the need to run analysis, right?
> Is the analysis in your scenario just a black box? And in order to
> apply the analysis, you'll need some kind API - how do you imagine it?
>
> The analysis is a black box, yes. What else could it be? I don't care
> how the POS tagger does what it does. All I'm interested in is what
> it
> needs as input, and how it gives me the output. I can parse JSON into
> Java pojos with jackson for example, that's super simple. Writing
> them
> out is even easier. What APIs do I need other than being able to tell
> some piece of analysis to do its stuff on a bunch of data?

One thing which must have been overlooked when UIMA was built is that
people (like me) have to write code which wants to interact with the CAS
but can't be an AE. In UIMA the CAS (either in memory, or serialized)
is difficult to
be used without implementing an AE. In those scenarios you usually
have to deal with some kind
of serialized CAS anyway. Today, it is really easy to serialize a CAS
into XMI, but that format
is not trivial to deal with at all.

And if you would like to interact with it in a different programming
language the entry barrier is
so high that I have never seen it anywhere (except our C++ layer). It
is probably easier to build
something similar for that particular use case.

Here, JSON would really help to be more compatible with different environments.
Reading, modifying and adding objects to a JSON structure can be done
in most programming languages without much overhead (if the structure
is not too complex).
Sometimes there is even direct support for JSON, e.g. in ElasticSearch
or browsers.
And soon also in Java.

It should be much easier to serialize/deserialize a CAS.
The best practice today is to implement an AE to achieve that, but
that again is not
nice, when I don't want to deal with AEs.

An AE is great to add structure to a document. After that is done there
is often code which work on that structured data. That cold be a mapreduce
job that is counting the number of tokens in a document collection.

In those cases it would be really really nice to just create/deserialize a CAS
and program against the CAS instead of rebuilding the parts of it that
are needed, e.g. iterating Person annotations in the order in which
they occur in the text, only iterating tokens inside a sentence, etc.

The CAS is also not flexible enough when it gets to the really simple cases,
maybe I just want to process only one FeatureStructure per CAS with an
AE I already built.
Some of my AEs only work on higher level FSes, like  a Person Entity.
Why is there so much overhead in creating a CAS with just one FS?

And today it is cumbersome to work with it in Java. The CAS interface
doesn't allow
me to use POJOs and JCas is too complex (e.g. code generation).

For UIMA v3 I really hope that we can rebuild the CAS so it is
something that was build today and
not 15 years ago.

Jörn

Re: Ideas for UIMA v3

Posted by Joern Kottmann <ko...@gmail.com>.
I think it would be very valuable to collect and write down a couple of
user stories to see where people have problems using UIMA, and maybe also
stories about how they would like to use it. We can also use these stories
to make design decision for v3.

If people are just holding it wrong there is probably no reason to even
make v3.

Jörn

On Thu, Jul 23, 2015 at 3:33 PM, Marshall Schor <ms...@schor.com> wrote:

>
>
> On 7/23/2015 8:55 AM, Richard Eckart de Castilho wrote:
> > On 23.07.2015, at 14:43, Joern Kottmann <ko...@gmail.com> wrote:
> >
> >> One thing which must have been overlooked when UIMA was built is that
> >> people (like me) have to write code which wants to interact with the CAS
> >> but can't be an AE. In UIMA the CAS (either in memory, or serialized)
> >> is difficult to
> >> be used without implementing an AE.
> > I'm not sure why you feel like that. E.g. in WebAnno (an annotation
> editor
> > that uses the CAS as its internal data model), create operate with the
> CAS
> > basically without any AEs. All editing operations are done directly on
> the
> > CAS which is loaded/saved directly using the UIMA API calls for binary
> > serialization.
>
> One possible reason why people feel this way might be that we're missing
> some
> entertaining and compelling stories in multiple kinds of media that
> explain how
> to do the
> kinds of things, easily, that people say are hard.  For example, there's
> little
> written on the
> new JSON serialization that gives lots of examples, including the
> super-simple
> varieties
> that are possible (i.e. omitting the context info).  Volunteers
> wanted/needed to
> write this! :-)
>
> -Marshall
> >
> > Basically, we are using the same API that we would be using in an AE,
> but without
> > the AE/pipelining stuff. It doesn't get any more difficult without the
> AE - in fact
> > some things become easier without AEs, readers, and consumers.
> >
> > I'm sure you must have something similar in the CAS Editor plugin in
> Eclipse, no?
> >
> > -- Richard
>
>

Re: Ideas for UIMA v3

Posted by Marshall Schor <ms...@schor.com>.

On 7/23/2015 8:55 AM, Richard Eckart de Castilho wrote:
> On 23.07.2015, at 14:43, Joern Kottmann <ko...@gmail.com> wrote:
>
>> One thing which must have been overlooked when UIMA was built is that
>> people (like me) have to write code which wants to interact with the CAS
>> but can't be an AE. In UIMA the CAS (either in memory, or serialized)
>> is difficult to
>> be used without implementing an AE.
> I'm not sure why you feel like that. E.g. in WebAnno (an annotation editor
> that uses the CAS as its internal data model), create operate with the CAS
> basically without any AEs. All editing operations are done directly on the
> CAS which is loaded/saved directly using the UIMA API calls for binary 
> serialization.

One possible reason why people feel this way might be that we're missing some
entertaining and compelling stories in multiple kinds of media that explain how
to do the
kinds of things, easily, that people say are hard.  For example, there's little
written on the
new JSON serialization that gives lots of examples, including the super-simple
varieties
that are possible (i.e. omitting the context info).  Volunteers wanted/needed to
write this! :-)

-Marshall
>
> Basically, we are using the same API that we would be using in an AE, but without
> the AE/pipelining stuff. It doesn't get any more difficult without the AE - in fact
> some things become easier without AEs, readers, and consumers. 
>
> I'm sure you must have something similar in the CAS Editor plugin in Eclipse, no?
>
> -- Richard


Re: Ideas for UIMA v3

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 23.07.2015, at 19:17, Joern Kottmann <ko...@gmail.com> wrote:

> With my CAS-like thing I can just write cas.getIndex("index37",
> EmailAddressAnnotation.class) and it just returns them as Java objects
> of type EmailAddressAnnotation. 
> 
> In a different place in the system the code might only assume its an
> annotation and retrieves the same index as objects of type AnnotationFS.
> cas.getIndex("index37", AnnotationFS.class)

This sounds like the kind of API that uimaFIT supports on top of uimaj-core in JCasUtil:

for (EmailAddressAnnotation : select(jcas, EmailAddressAnnotation.class)) {
  ...
}

for (Annotation : select(jcas, Annotation.class)) {
  ...
}

The recent changes in uimaj-core allow almost the same now:


for (EmailAddressAnnotation t : jcas.getAnnotationIndex(EmailAddressAnnotation.class)) {
  ...		    
}

However, as far as I see it neither uimaFIT nor UIMA currently provide this kind of API for custom indexes - should be easy to add.

> UIMA doesn't make it easy with its static type system to write generic
> code working with CASes.

It is not possible (without hacks) to add features or types during pipeline execution. But since you appear to be working with pre-defined (Java) classes (similar to JCas wrappers), I don't see a problem. You have to know your Java classes at build time, so why is it not possible to know the typesystem at build time?

-- Richard

Re: Ideas for UIMA v3

Posted by Michael Tanenblatt <sl...@park-slope.net>.
> On Jul 24, 2015, at 8:08 AM, Joern Kottmann <ko...@gmail.com> wrote:
> 
> On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <rec@apache.org <ma...@apache.org>
>> wrote:
> 
>> On 23.07.2015, at 19:17, Joern Kottmann <ko...@gmail.com> wrote:
>> 
>>>> If this is the scenario, another option would be to have the serialized
>> CASes
>>>> stored along with a reference to their type system, and have some new
>>>> deserialization capability be able to locate the referred-to type
>> system along
>>>> with the CAS to be read in.  Would that "solve" this issue, or are
>> there other
>>>> aspects?
>> 
>> https://issues.apache.org/jira/browse/UIMA-2127 ;)
>> 
>> But having the TS stored alongside the CAS also is nice - see below.
>> 
>>> It would probably solve it, but it is not a simple solution either. That
>>> would mean that the Type System get switched frequently and have be
>>> looked up all the time.
>> 
>> For DKPro Core, I have implemented a BinaryCasWriter that stores the type
>> system in the same file as the binary serialized CAS. It is not always the
>> best solution because it adds a fixed overhead to every file, but it is
>> very convenient. Optionally, the type system can be stored externally in a
>> separate file to avoid this overhead. If and how this typesystem can be
>> used depends on which of the six kinds of binary serialization is being
>> used. See [1] for an overview over these formats and their properties.
>> 
>> 
> We have a few hundred million documents in the system, storing the ts with
> each document would be wasteful. It needs storage and it has to be parsed
> for each CAS.
> 
> 
> 
>> In the BinaryCasReader, depending on the type of serialization, either:
>> - there is a failure if the pipeline CAS typesystem is not compatible with
>> the persisted CAS;
>> - the type system in the pipeline CAS is reinitialized from the persisted
>> CAS;
>> - the data from the persisted CAS is loaded leniently, dropping all FSes
>> that are not defined in the pipeline CAS typesystem
>> 
>> Furthermore, the BinaryCasReader auto-detects the binary format and loads
>> it, be it the Java serialization-based format or one of the binary formats
>> that Marschall recently created, or our extended format that also embeds
>> the typesystem in the file.
>> 
>> Mind that depending on the use-case a different kind of serialization may
>> be appropriate.
>> 
>> For me, this covers in particular the following use-cases:
>> 
>> - fast (de)serialization of the entire CAS
>> - compact binary format (some more some less)
>> - stable FS addresses (in some formats)
>> - restoring the pipeline CAS type system from file (i.e. CAS can be
>> initialized with an empty type system on creation and TS is set by reader -
>> in some formats)
>> - lenient loading of data allowing for different TSes on disk and in
>> pipeline (in some formats)
>> 
>> Would such an approach cover (some of your) use-cases?
>> 
> 
> 
> With the current design the best option is probably to store a type system
> id with the document.



Agreed with this—where the type system ID is a URI


> It would be nice to avoid that additional complexity.
> 
> I think I have mainly two cases I can't really deal with:
> - A CAS contains FSes of many types. I know a few of those types and would
> like to only work with them. Not interested at all in the FSes with other
> types.
> - A CAS contains FSes of many types. I just want to deal with them as if
> they have a certain super-type. That could be FeatureStructure or
> AnnotationFS.
> 
> The CASes above have been produced by many different AAEs with similar, but
> slightly different type systems.

Right, those are typical issues that people will commonly need to surmount, particularly the second, where the super-type is some relatively generic type (e.g., Token).

..m

Re: Ideas for UIMA v3

Posted by Joern Kottmann <ko...@gmail.com>.
On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <rec@apache.org
> wrote:

> On 23.07.2015, at 19:17, Joern Kottmann <ko...@gmail.com> wrote:
>
> >> If this is the scenario, another option would be to have the serialized
> CASes
> >> stored along with a reference to their type system, and have some new
> >> deserialization capability be able to locate the referred-to type
> system along
> >> with the CAS to be read in.  Would that "solve" this issue, or are
> there other
> >> aspects?
>
> https://issues.apache.org/jira/browse/UIMA-2127 ;)
>
> But having the TS stored alongside the CAS also is nice - see below.
>
> > It would probably solve it, but it is not a simple solution either. That
> > would mean that the Type System get switched frequently and have be
> > looked up all the time.
>
> For DKPro Core, I have implemented a BinaryCasWriter that stores the type
> system in the same file as the binary serialized CAS. It is not always the
> best solution because it adds a fixed overhead to every file, but it is
> very convenient. Optionally, the type system can be stored externally in a
> separate file to avoid this overhead. If and how this typesystem can be
> used depends on which of the six kinds of binary serialization is being
> used. See [1] for an overview over these formats and their properties.
>
>
We have a few hundred million documents in the system, storing the ts with
each document would be wasteful. It needs storage and it has to be parsed
for each CAS.



> In the BinaryCasReader, depending on the type of serialization, either:
> - there is a failure if the pipeline CAS typesystem is not compatible with
> the persisted CAS;
> - the type system in the pipeline CAS is reinitialized from the persisted
> CAS;
> - the data from the persisted CAS is loaded leniently, dropping all FSes
> that are not defined in the pipeline CAS typesystem
>
> Furthermore, the BinaryCasReader auto-detects the binary format and loads
> it, be it the Java serialization-based format or one of the binary formats
> that Marschall recently created, or our extended format that also embeds
> the typesystem in the file.
>
> Mind that depending on the use-case a different kind of serialization may
> be appropriate.
>
> For me, this covers in particular the following use-cases:
>
> - fast (de)serialization of the entire CAS
> - compact binary format (some more some less)
> - stable FS addresses (in some formats)
> - restoring the pipeline CAS type system from file (i.e. CAS can be
> initialized with an empty type system on creation and TS is set by reader -
> in some formats)
> - lenient loading of data allowing for different TSes on disk and in
> pipeline (in some formats)
>
> Would such an approach cover (some of your) use-cases?
>


With the current design the best option is probably to store a type system
id with the document.
It would be nice to avoid that additional complexity.

I think I have mainly two cases I can't really deal with:
- A CAS contains FSes of many types. I know a few of those types and would
like to only work with them. Not interested at all in the FSes with other
types.
- A CAS contains FSes of many types. I just want to deal with them as if
they have a certain super-type. That could be FeatureStructure or
AnnotationFS.

The CASes above have been produced by many different AAEs with similar, but
slightly different type systems.

Jörn

Re: Ideas for UIMA v3

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 23.07.2015, at 19:17, Joern Kottmann <ko...@gmail.com> wrote:

>> If this is the scenario, another option would be to have the serialized CASes
>> stored along with a reference to their type system, and have some new
>> deserialization capability be able to locate the referred-to type system along
>> with the CAS to be read in.  Would that "solve" this issue, or are there other
>> aspects?

https://issues.apache.org/jira/browse/UIMA-2127 ;)

But having the TS stored alongside the CAS also is nice - see below.

> It would probably solve it, but it is not a simple solution either. That
> would mean that the Type System get switched frequently and have be
> looked up all the time.

For DKPro Core, I have implemented a BinaryCasWriter that stores the type system in the same file as the binary serialized CAS. It is not always the best solution because it adds a fixed overhead to every file, but it is very convenient. Optionally, the type system can be stored externally in a separate file to avoid this overhead. If and how this typesystem can be used depends on which of the six kinds of binary serialization is being used. See [1] for an overview over these formats and their properties.

In the BinaryCasReader, depending on the type of serialization, either:
- there is a failure if the pipeline CAS typesystem is not compatible with the persisted CAS;
- the type system in the pipeline CAS is reinitialized from the persisted CAS;
- the data from the persisted CAS is loaded leniently, dropping all FSes that are not defined in the pipeline CAS typesystem

Furthermore, the BinaryCasReader auto-detects the binary format and loads it, be it the Java serialization-based format or one of the binary formats that Marschall recently created, or our extended format that also embeds the typesystem in the file.

Mind that depending on the use-case a different kind of serialization may be appropriate.

For me, this covers in particular the following use-cases:

- fast (de)serialization of the entire CAS
- compact binary format (some more some less)
- stable FS addresses (in some formats)
- restoring the pipeline CAS type system from file (i.e. CAS can be initialized with an empty type system on creation and TS is set by reader - in some formats)
- lenient loading of data allowing for different TSes on disk and in pipeline (in some formats)

Would such an approach cover (some of your) use-cases? 

Cheers,

-- Richard

[1] http://www.dkpro.org/dkpro-core/releases/1.7.0/apidocs/index.html?de/tudarmstadt/ukp/dkpro/core/io/bincas/BinaryCasWriter.html


Re: Ideas for UIMA v3

Posted by Joern Kottmann <ko...@gmail.com>.
On Thu, 2015-07-23 at 11:01 -0400, Marshall Schor wrote:
> Hi Jörn,
> 
> Thank you for your comments; I hope you can expand a bit (see below).
> 
> On 7/23/2015 9:45 AM, Joern Kottmann wrote:
> > Well, I thought about something which can be done in  3, 4 or 5 lines of
> > code.
> >
> > To use a CAS, its first creating the TypeSystemDescriptor, creating an
> > empty CAS and then loading something into it.
> > Placing content in it is often done using an AE. If I want to reuse an
> > existing deserializer/serializer I always end up with an AE,
> > maybe there are some rare exceptions.
> >
> > In a bigger system there will be a couple of components dealing with CASes,
> > if there is a small change to the type system they all have to be updated,
> > even when they are not affected by the change, e.g. type addition or a
> > change to a type they don't use. 
> I'd like to understand this better.  Since the pipeline's final type system is
> created at pipeline-startup-time, from the "merge" of all the component's type
> systems, it seems to me that you would not need to update the type systems in
> other components not affected by the change?

I was not referring to UIMA components here.

Imagine a system that uses multiple AAEs to analyze some documents. The
documents might have really different types. The AAEs do a great job
adapting to the document types by using the right AEs to deal with the
content. An AE added to an AAE can also introduce new types. These new
types are merged into the type system of that particular AAE. The FS
added to the CAS having those types might be interesting later when
viewing that document, or for things that are specific to that
particular document, but not be important across the entire document
collection.

All the CASes outputted by these different AAEs have to be further
processed. And that is where things get tricky. The component dealing
with them is probably again very specific and might only want to look at
a few FeatureStructure types or maybe at all.

How can we write a mapreduce job that processes all CASes (with slightly
different but not incompatible type systems) in a database. Maybe
something as simple as the count of all Email Address Annotations in all
those CASes.

To be able to load that content into a CAS we either have to swap the
type system per cas type (not nice) or just merge all existing type
systems together.

If this type system in one of my AAEs now changes, e.g. type addition,
the mapred job also has to be updated with the new type system, even
tough it might never deal with that type.

Ok, that we can maybe somehow solve by stripping the unkown types from
the CASes. 

> If the concern is the need to have a JCas cover class generated for the merged
> type system, version 3 is hoping to make that "automatic".
> > In our system we have many different
> > import pipelines, sometimes those pipelines have specific types which are
> > only used in an early stage, if a generic component has to deal with one of
> > those CASes the only good option is to merge all type systems together.
> Since UIMA pipelines do this type merge, I'm guessing you might be thinking
> about this outside of UIMA pipelines, such as a scenario where you have one step
> (using those many different import pipelines), and perhaps having those write
> out some CASs, and then wanting to read in those CASes in another step to be
> processed by your generic component, and therefore needing that 2nd step to have
> the merge of all the type systems together, to enable deserializing.  Is this
> the scenario, or is there another use case you're thinking of?

Yes, but that generic component might have different requirements, maybe
it just deals with a few types it knows very well, or it can deal with
all types.

> If this is the scenario, another option would be to have the serialized CASes
> stored along with a reference to their type system, and have some new
> deserialization capability be able to locate the referred-to type system along
> with the CAS to be read in.  Would that "solve" this issue, or are there other
> aspects?


It would probably solve it, but it is not a simple solution either. That
would mean that the Type System get switched frequently and have be
looked up all the time.

There is a CAS. It maybe contain, or doesn't contain FSes of a certain
type. The type is always known by the code dealing with the CAS.

Why do I first have to load the right type system to retrieve those
FSes?

With my CAS-like thing I can just write cas.getIndex("index37",
EmailAddressAnnotation.class) and it just returns them as Java objects
of type EmailAddressAnnotation. 

There is no type system, and it doesn't work well with types I don't
know anything about, but at that place I am also not interested in
those.

In a different place in the system the code might only assume its an
annotation and retrieves the same index as objects of type AnnotationFS.
cas.getIndex("index37", AnnotationFS.class)

UIMA doesn't make it easy with its static type system to write generic
code working with CASes.

> >
> > The way we use UIMA is that we let it process our content with different
> > custom pipelines, and at the end of each pipeline the results are converted
> > into POJOs and those are written into a database, all code which follows
> > just uses the POJOs to process the data. My point is: If the CAS would be
> > in a better state we could just use it through out the entire application
> > instead of our CAS-like layer.
> 
> In version 3, we're planning on storing the Feature Structures as just instances
> of their JCas Java Cover Objects, pretty close to POJOs. So maybe there's a good
> chance...

Do we just use POJOs or are they again generated from a type system?

Jörn

Re: Ideas for UIMA v3

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 23.07.2015, at 17:01, Marshall Schor <ms...@schor.com> wrote:

>> The way we use UIMA is that we let it process our content with different
>> custom pipelines, and at the end of each pipeline the results are converted
>> into POJOs and those are written into a database, all code which follows
>> just uses the POJOs to process the data. My point is: If the CAS would be
>> in a better state we could just use it through out the entire application
>> instead of our CAS-like layer.
> 
> In version 3, we're planning on storing the Feature Structures as just instances
> of their JCas Java Cover Objects, pretty close to POJOs. So maybe there's a good
> chance...

You mean as "instances of Java Cover Objects" without "JCas", right? 

I mean there should still be the possibility to work with the CAS without JCas objects, basically via the CAS interface, no? We shouldn't have to use some Java reflection and bean utils to work with JCas - and there was also one wish for a more schemaless CAS which contradicts only having a strongly typed JCas API.

Cheers,

-- Richard

Re: Ideas for UIMA v3

Posted by Marshall Schor <ms...@schor.com>.
Hi Jörn,

Thank you for your comments; I hope you can expand a bit (see below).

On 7/23/2015 9:45 AM, Joern Kottmann wrote:
> Well, I thought about something which can be done in  3, 4 or 5 lines of
> code.
>
> To use a CAS, its first creating the TypeSystemDescriptor, creating an
> empty CAS and then loading something into it.
> Placing content in it is often done using an AE. If I want to reuse an
> existing deserializer/serializer I always end up with an AE,
> maybe there are some rare exceptions.
>
> In a bigger system there will be a couple of components dealing with CASes,
> if there is a small change to the type system they all have to be updated,
> even when they are not affected by the change, e.g. type addition or a
> change to a type they don't use. 
I'd like to understand this better.  Since the pipeline's final type system is
created at pipeline-startup-time, from the "merge" of all the component's type
systems, it seems to me that you would not need to update the type systems in
other components not affected by the change?

If the concern is the need to have a JCas cover class generated for the merged
type system, version 3 is hoping to make that "automatic".
> In our system we have many different
> import pipelines, sometimes those pipelines have specific types which are
> only used in an early stage, if a generic component has to deal with one of
> those CASes the only good option is to merge all type systems together.
Since UIMA pipelines do this type merge, I'm guessing you might be thinking
about this outside of UIMA pipelines, such as a scenario where you have one step
(using those many different import pipelines), and perhaps having those write
out some CASs, and then wanting to read in those CASes in another step to be
processed by your generic component, and therefore needing that 2nd step to have
the merge of all the type systems together, to enable deserializing.  Is this
the scenario, or is there another use case you're thinking of?

If this is the scenario, another option would be to have the serialized CASes
stored along with a reference to their type system, and have some new
deserialization capability be able to locate the referred-to type system along
with the CAS to be read in.  Would that "solve" this issue, or are there other
aspects?

>
> The way we use UIMA is that we let it process our content with different
> custom pipelines, and at the end of each pipeline the results are converted
> into POJOs and those are written into a database, all code which follows
> just uses the POJOs to process the data. My point is: If the CAS would be
> in a better state we could just use it through out the entire application
> instead of our CAS-like layer.

In version 3, we're planning on storing the Feature Structures as just instances
of their JCas Java Cover Objects, pretty close to POJOs. So maybe there's a good
chance...

-Marshall

>
> Jörn
>
> On Thu, Jul 23, 2015 at 2:55 PM, Richard Eckart de Castilho <re...@apache.org>
> wrote:
>
>> On 23.07.2015, at 14:43, Joern Kottmann <ko...@gmail.com> wrote:
>>
>>> One thing which must have been overlooked when UIMA was built is that
>>> people (like me) have to write code which wants to interact with the CAS
>>> but can't be an AE. In UIMA the CAS (either in memory, or serialized)
>>> is difficult to
>>> be used without implementing an AE.
>> I'm not sure why you feel like that. E.g. in WebAnno (an annotation editor
>> that uses the CAS as its internal data model), create operate with the CAS
>> basically without any AEs. All editing operations are done directly on the
>> CAS which is loaded/saved directly using the UIMA API calls for binary
>> serialization.
>>
>> Basically, we are using the same API that we would be using in an AE, but
>> without
>> the AE/pipelining stuff. It doesn't get any more difficult without the AE
>> - in fact
>> some things become easier without AEs, readers, and consumers.
>>
>> I'm sure you must have something similar in the CAS Editor plugin in
>> Eclipse, no?
>>
>> -- Richard


Re: Ideas for UIMA v3

Posted by Joern Kottmann <ko...@gmail.com>.
Well, I thought about something which can be done in  3, 4 or 5 lines of
code.

To use a CAS, its first creating the TypeSystemDescriptor, creating an
empty CAS and then loading something into it.
Placing content in it is often done using an AE. If I want to reuse an
existing deserializer/serializer I always end up with an AE,
maybe there are some rare exceptions.

In a bigger system there will be a couple of components dealing with CASes,
if there is a small change to the type system they all have to be updated,
even when they are not affected by the change, e.g. type addition or a
change to a type they don't use. In our system we have many different
import pipelines, sometimes those pipelines have specific types which are
only used in an early stage, if a generic component has to deal with one of
those CASes the only good option is to merge all type systems together.

The way we use UIMA is that we let it process our content with different
custom pipelines, and at the end of each pipeline the results are converted
into POJOs and those are written into a database, all code which follows
just uses the POJOs to process the data. My point is: If the CAS would be
in a better state we could just use it through out the entire application
instead of our CAS-like layer.

Jörn

On Thu, Jul 23, 2015 at 2:55 PM, Richard Eckart de Castilho <re...@apache.org>
wrote:

> On 23.07.2015, at 14:43, Joern Kottmann <ko...@gmail.com> wrote:
>
> > One thing which must have been overlooked when UIMA was built is that
> > people (like me) have to write code which wants to interact with the CAS
> > but can't be an AE. In UIMA the CAS (either in memory, or serialized)
> > is difficult to
> > be used without implementing an AE.
>
> I'm not sure why you feel like that. E.g. in WebAnno (an annotation editor
> that uses the CAS as its internal data model), create operate with the CAS
> basically without any AEs. All editing operations are done directly on the
> CAS which is loaded/saved directly using the UIMA API calls for binary
> serialization.
>
> Basically, we are using the same API that we would be using in an AE, but
> without
> the AE/pipelining stuff. It doesn't get any more difficult without the AE
> - in fact
> some things become easier without AEs, readers, and consumers.
>
> I'm sure you must have something similar in the CAS Editor plugin in
> Eclipse, no?
>
> -- Richard

Re: Ideas for UIMA v3

Posted by Richard Eckart de Castilho <re...@apache.org>.
On 23.07.2015, at 14:43, Joern Kottmann <ko...@gmail.com> wrote:

> One thing which must have been overlooked when UIMA was built is that
> people (like me) have to write code which wants to interact with the CAS
> but can't be an AE. In UIMA the CAS (either in memory, or serialized)
> is difficult to
> be used without implementing an AE.

I'm not sure why you feel like that. E.g. in WebAnno (an annotation editor
that uses the CAS as its internal data model), create operate with the CAS
basically without any AEs. All editing operations are done directly on the
CAS which is loaded/saved directly using the UIMA API calls for binary 
serialization.

Basically, we are using the same API that we would be using in an AE, but without
the AE/pipelining stuff. It doesn't get any more difficult without the AE - in fact
some things become easier without AEs, readers, and consumers. 

I'm sure you must have something similar in the CAS Editor plugin in Eclipse, no?

-- Richard