You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Luca Foppiano <lu...@foppiano.org> on 2014/01/23 09:21:53 UTC

uima-fit and uima annotators (in my case Whitespace annotator)

Hi Everybody,
    I'm starting playing with uima-fit and I'm trying to integrate the
whitespace annotator into my simple pipeline composed by a collection
reader a simple AE (plays with the text, doesn't annotate) and I want to
add a whitespace annotator to be applied to the text.

I've download the trunk version of the Whitespace annotator on github, I've
extracted the type system definition from the descriptor XML and referenced
it from uimafit. The pipeline worked without crashing.

Now I want to add an AE that takes the annotations and do something with
that (print them for example).

I could not find a way to work around the fact the type system java class
were not present in the project, is this a mandatory requirement?

What I've tried is to do something like:

//Get the type autogeneated type system (SentenceAnnotation,
TokenAnnotation)
TypeDescription[] types = tsd.getTypes();

[...]
//..and try to pass them to my annotator
        AnalysisEngineDescription casConsumer =
AnalysisEngineFactory.createEngineDescription(SimpleCC.class,
                SimpleCC.OUTPUT_DIR_PARAM,
                "/home/lf84914/development/epo/apl/data/out",
*                types, null*);

but then, in the AE's code, I have no idea how to use them.

Any suggestions?

Thank everybody in advance.
-- 
Luca Foppiano

Software Engineer
+31615253280
luca@foppiano.org
www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Luca Foppiano <lu...@foppiano.org>.

On Thu, Jan 23, 2014 at 6:18 PM, Richard Eckart de Castilho
<re...@apache.org>wrote:

> Thanks. Here are some more specific tips:
>

[...]

Hi Richard,
     Thanks a lot for your detailed explanation. I'll try it out by the end
of this week and I will add it a the documentation.

Cheers
-- 
Luca Foppiano

Software Engineer
+31615253280
luca@foppiano.org
www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Richard Eckart de Castilho <re...@apache.org>.

I opened an issue for that and will try to look into it before the
next uimaFIT release.

https://issues.apache.org/jira/browse/UIMA-3591

Thanks for the feedback!

-- Richard

On 29.01.2014, at 14:06, Luca Foppiano <lu...@foppiano.org> wrote:

> Hi Richard,
>  here some feedback:
> 
> On Thu, Jan 23, 2014 at 6:18 PM, Richard Eckart de Castilho
> <re...@apache.org>wrote:
> 
>> Thanks. Here are some more specific tips:
>> 
>> 
> [...]
> 
> 
>> uimaFIT should be able to automatically coerce single values into
>> multi-valued parameters. So it should be possible to write this
>> 
>> AnalysisEngineFactory.createEngineDescription(WhitespaceTokenizer.class,
>>                "SofaNames", SimpleParserAE.SOFA_NAME_TEXT_ONLY);
>> 
>> 
> This is not working, I got a ClassCastException
> 
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
> [Ljava.lang.String;
>    at
> org.apache.uima.annotator.WhitespaceTokenizer.initialize(WhitespaceTokenizer.java:328)
>    at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
> 
> In principle I don't have control over that annotator, so if it does
> strange operation on the initialization, I'm not able to change it. If I
> leave the String[] works without problems.
> 
> Thanks
> 
> -- 
> Luca Foppiano
> 
> Software Engineer
> +31615253280
> luca@foppiano.org
> www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Luca Foppiano <lu...@foppiano.org>.

Hi Richard,
  here some feedback:

On Thu, Jan 23, 2014 at 6:18 PM, Richard Eckart de Castilho
<re...@apache.org>wrote:

> Thanks. Here are some more specific tips:
>
>
[...]


> uimaFIT should be able to automatically coerce single values into
> multi-valued parameters. So it should be possible to write this
>
> AnalysisEngineFactory.createEngineDescription(WhitespaceTokenizer.class,
>                 "SofaNames", SimpleParserAE.SOFA_NAME_TEXT_ONLY);
>
>
This is not working, I got a ClassCastException

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
[Ljava.lang.String;
    at
org.apache.uima.annotator.WhitespaceTokenizer.initialize(WhitespaceTokenizer.java:328)
    at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)

In principle I don't have control over that annotator, so if it does
strange operation on the initialization, I'm not able to change it. If I
leave the String[] works without problems.

Thanks

-- 
Luca Foppiano

Software Engineer
+31615253280
luca@foppiano.org
www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Richard Eckart de Castilho <re...@apache.org>.

Thanks. Here are some more specific tips:

You can specify all engines in the call to runPipeline - no need for the AggregateBuilder
unless you need to do sofa mappings.

SimplePipeline.runPipeline(reader, preparationEngine, whitespaceEngine, casConsumer));

Parameter constants typically begin with "PARAM_" instead of ending in "_PARAM". That makes a difference if you ever plan to use the uimafit-maven-plugin to automatically generate descriptors from your AEs, because it uses prefixes to detect parameter name constants.

uimaFIT should be able to automatically coerce single values into multi-valued parameters. So it should be possible to write this

AnalysisEngineFactory.createEngineDescription(WhitespaceTokenizer.class,
                "SofaNames", SimpleParserAE.SOFA_NAME_TEXT_ONLY);

Cheers,

-- Richard

On 23.01.2014, at 14:45, Luca Foppiano <lu...@foppiano.org> wrote:

> On Thu, Jan 23, 2014 at 3:13 PM, Richard Eckart de Castilho
> <re...@apache.org>wrote:
> 
>> Hi,
>> 
>> Hi Richard,
> 
> 
>> can you provide the full code for your sample pipeline? I think that would
>> make it easier to help.
>> 
> 
> Sure, is located here: https://github.com/lfoppiano/uima-fit-sample-pipeline
> 
> 
>> With the present information, I can only give some general advice.
>> 
>> [...]
> 
>> 
>> I would recommend using the CAS/CasUtil only if you want to implement a
>> generic component that can be configured to work with different types. If
>> your component is fixed to a certain type system, then using the
>> JCas/JCasUtil is much more convenient.
>> 
> 
> Thanks a lot for your input, in fact it shed some light1 around type
> systems.
> 
> Regards
> -- 
> Luca Foppiano
> 
> Software Engineer
> +31615253280
> luca@foppiano.org
> www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Luca Foppiano <lu...@foppiano.org>.

On Thu, Jan 23, 2014 at 3:13 PM, Richard Eckart de Castilho
<re...@apache.org>wrote:

> Hi,
>
> Hi Richard,


> can you provide the full code for your sample pipeline? I think that would
> make it easier to help.
>

Sure, is located here: https://github.com/lfoppiano/uima-fit-sample-pipeline


> With the present information, I can only give some general advice.
>
> [...]

>
> I would recommend using the CAS/CasUtil only if you want to implement a
> generic component that can be configured to work with different types. If
> your component is fixed to a certain type system, then using the
> JCas/JCasUtil is much more convenient.
>

Thanks a lot for your input, in fact it shed some light around type
systems.

Regards
-- 
Luca Foppiano

Software Engineer
+31615253280
luca@foppiano.org
www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Richard Eckart de Castilho <re...@apache.org>.

See comments inline. I've removed those parts that do not seem to require
further discussion.

On 29.01.2014, at 17:32, Luca Foppiano <lu...@foppiano.org> wrote:
>> - the type systems of all components in a pipeline is automatically merged
>> when a pipeline is run (e.g. using SimplePipeline.runPipeline). Thus, it
>> would also work to pass a TSD with all types used in the pipeline only to
>> the reader, but not to any of the subsequent components.
>> 
> 
> Ok, that's an important point in fact.
> Do you know if the order (if it is passed to the first or last component)
> does matters?

It does not matter. All type information from all components are merged
together and used to initialize the CASes which are passed through all
of the components.

>> - alternatively, it is possible to have uimaFIT automatically detect your
>> types [1]. If you do that, there is no need at all to pass the TSD to the
>> component - it happens automatically.
>> 
>>  createEngineDescription(SimpleCC.class,
>>    SimpleCC.PARAM_OUTPUT_DIR, "…");
>> 
> 
> OK. Do you have an example/use case of when the TSD should be passed to the
> engine? Perhaps when the type system is loaded by manually fetching the
> information or reading the descriptor programmatically?

It needs to be passed when you do not make use of uimaFIT's feature for
auto-detecting the type descriptors.

A case where you may want to do this is, when you want extra control over
the types you pass in, e.g.:

- when you want to pass only a subset of the types that would be auto-discovered
- when you programmatically generate or modify a type system
- if there is reason that you cannot use auto-discovery (possibly in OSGi environments)

>> - if you want to retrieve annotation from the CAS without using the JCas
>> wrappers, you can have a look at the CasUtil class. E.g.
>> 
>>  CasUtil.select(cas, CasUtil.getType(cas, "my.package.name.MyType"))
>> 
>> Mind, this call works only if "MyType" inherits from the built-in
>> "Annotation" type. Otherwise, you would use "selectFS" instead of "select".
>> 
>> I would recommend using the CAS/CasUtil only if you want to implement a
>> generic component that can be configured to work with different types. If
>> your component is fixed to a certain type system, then using the
>> JCas/JCasUtil is much more convenient.
> 
> OK, that's definitely helpful, but I still have a bit of confusion in my
> head between JCas and CAS.
> 
> In my example I could use JCAs, the problem is that the JCASUtils.select()
> method require the Class Type system
> 
> [...] select([...],  *final Class<T> type*) [...]
> 
> while the Cas/CasUtil select() method takes the type defined as Type. Is
> there a reason for this difference? I might have missed/forgotten something
> or some part of the documentation

The JCas maps the UIMA type system to the Java type system. The CAS is one level
below that. Some people that first learned JCas, later try to use Java reflection
to dynamically create a annotation of a certain type based on a type name passed
to the component as a parameter. If you ever think about using reflection on 
JCas types, you should instead use the CAS interface.

There may be reasons like performance or memory usage that favor one over the other
interface - however, I personally did not make any extensive evaluations on that.
I generally favor convenient programming over pre-mature optimizations. So far,
JCas vs. CAS didn't seem to be a problem for me.

Cheers,

-- Richard

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Luca Foppiano <lu...@foppiano.org>.

On Thu, Jan 23, 2014 at 3:13 PM, Richard Eckart de Castilho
<re...@apache.org>wrote:

> Hi,
>

Hi again,
  see below my feedback about the points you've previously made.
[...]

- it is not mandatory to have the type system java classes (JCas wrappers)
> present in a project if none of your components (Readers, AEs, CCs) use
> them.
>

Indeed, in my sample I want to take an 3rd party annotator and experiment
the integration.


> - it is possible to manually load a type system description (TSD) and pass
> it to the components. But then the TSD is the second argument to the
> createXXXDescription call, e.g.
>
>   createEngineDescription(SimpleCC.class, tsd,
>     SimpleCC.PARAM_OUTPUT_DIR, "…");
>


- the type systems of all components in a pipeline is automatically merged
> when a pipeline is run (e.g. using SimplePipeline.runPipeline). Thus, it
> would also work to pass a TSD with all types used in the pipeline only to
> the reader, but not to any of the subsequent components.
>

Ok, that's an important point in fact.
Do you know if the order (if it is passed to the first or last component)
does matters?


> - alternatively, it is possible to have uimaFIT automatically detect your
> types [1]. If you do that, there is no need at all to pass the TSD to the
> component - it happens automatically.
>
>   createEngineDescription(SimpleCC.class,
>     SimpleCC.PARAM_OUTPUT_DIR, "…");
>

OK. Do you have an example/use case of when the TSD should be passed to the
engine? Perhaps when the type system is loaded by manually fetching the
information or reading the descriptor programmatically?



> - if you want to retrieve annotation from the CAS without using the JCas
> wrappers, you can have a look at the CasUtil class. E.g.
>
>   CasUtil.select(cas, CasUtil.getType(cas, "my.package.name.MyType"))
>
> Mind, this call works only if "MyType" inherits from the built-in
> "Annotation" type. Otherwise, you would use "selectFS" instead of "select".
>
> I would recommend using the CAS/CasUtil only if you want to implement a
> generic component that can be configured to work with different types. If
> your component is fixed to a certain type system, then using the
> JCas/JCasUtil is much more convenient.
>
> OK, that's definitely helpful, but I still have a bit of confusion in my
head between JCas and CAS.

In my example I could use JCAs, the problem is that the JCASUtils.select()
method require the Class Type system

[...] select([...],  *final Class<T> type*) [...]

while the Cas/CasUtil select() method takes the type defined as Type. Is
there a reason for this difference? I might have missed/forgotten something
or some part of the documentation

Thank you again
-- 
Luca Foppiano

Software Engineer
+31615253280
luca@foppiano.org
www.foppiano.org

Re: uima-fit and uima annotators (in my case Whitespace annotator)

Posted by Richard Eckart de Castilho <re...@apache.org>.

Hi,

can you provide the full code for your sample pipeline? I think that would make it easier to help.

With the present information, I can only give some general advice.

- it is not mandatory to have the type system java classes (JCas wrappers) present in a project if none of your components (Readers, AEs, CCs) use them.

- it is possible to manually load a type system description (TSD) and pass it to the components. But then the TSD is the second argument to the createXXXDescription call, e.g.

  createEngineDescription(SimpleCC.class, tsd, 
    SimpleCC.PARAM_OUTPUT_DIR, "…");

- the type systems of all components in a pipeline is automatically merged when a pipeline is run (e.g. using SimplePipeline.runPipeline). Thus, it would also work to pass a TSD with all types used in the pipeline only to the reader, but not to any of the subsequent components.

- alternatively, it is possible to have uimaFIT automatically detect your types [1]. If you do that, there is no need at all to pass the TSD to the component - it happens automatically.

  createEngineDescription(SimpleCC.class,
    SimpleCC.PARAM_OUTPUT_DIR, "…");

- if you want to retrieve annotation from the CAS without using the JCas wrappers, you can have a look at the CasUtil class. E.g.

  CasUtil.select(cas, CasUtil.getType(cas, "my.package.name.MyType"))

Mind, this call works only if "MyType" inherits from the built-in "Annotation" type. Otherwise, you would use "selectFS" instead of "select".

I would recommend using the CAS/CasUtil only if you want to implement a generic component that can be configured to work with different types. If your component is fixed to a certain type system, then using the JCas/JCasUtil is much more convenient.

-- Richard

[1] http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.typesystem


On 23.01.2014, at 06:21, Luca Foppiano <lu...@foppiano.org> wrote:

> Hi Everybody,
>    I'm starting playing with uima-fit and I'm trying to integrate the
> whitespace annotator into my simple pipeline composed by a collection
> reader a simple AE (plays with the text, doesn't annotate) and I want to
> add a whitespace annotator to be applied to the text.
> 
> I've download the trunk version of the Whitespace annotator on github, I've
> extracted the type system definition from the descriptor XML and referenced
> it from uimafit. The pipeline worked without crashing.
> 
> Now I want to add an AE that takes the annotations and do something with
> that (print them for example).
> 
> I could not find a way to work around the fact the type system java class
> were not present in the project, is this a mandatory requirement?
> 
> What I've tried is to do something like:
> 
> //Get the type autogeneated type system (SentenceAnnotation,
> TokenAnnotation)
> TypeDescription[] types = tsd.getTypes();
> 
> [...]
> //..and try to pass them to my annotator
>        AnalysisEngineDescription casConsumer =
> AnalysisEngineFactory.createEngineDescription(SimpleCC.class,
>                SimpleCC.OUTPUT_DIR_PARAM,
>                "/home/lf84914/development/epo/apl/data/out",
> *                types, null*);
> 
> but then, in the AE's code, I have no idea how to use them.
> 
> Any suggestions?
> 
> Thank everybody in advance.
> -- 
> Luca Foppiano
> 
> Software Engineer
> +31615253280
> luca@foppiano.org
> www.foppiano.org