You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Andreas Niekler <an...@informatik.uni-leipzig.de> on 2012/08/16 21:28:31 UTC

Multiple views in Annotator Engine

Hello,

i wonder if it is possible to define multiple sofa's (views) in a UIMA 
Collection Reader and pass those differnt contents to the sentence 
annotator of the openNLP Tools. Will there be a sentence annotation for 
each sofa (view) or does openNLP UIMA automatically choose the first 
sofa in the data?

How could i implement such a CAS case where i'm able to annotate title, 
document and subtitle (for example) seperately in one chain?

Thank you

Andreas

Re: SolrCas does not store all features

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hi tommaso,

I already did Andreas fixed the problem.

All the best

Andreas
-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.



Tommaso Teofili <to...@gmail.com> schrieb:

Hi Andreas,

I'm taking a look at your use case and I think it makes sense to support
DocumentAnnotation types too however I didn't have a chance to test it yet.
Therefore feel free to open a Jira issue and eventually submit a patch for
that.

Thanks and regards,
Tommaso


2012/8/21 Andreas Niekler <an...@informatik.uni-leipzig.de>

> Hello,
>
> i had a look into the source code of the annotator and actually it is
> using a HashMap to store the values by the type name. Would it be better to
> use a list for the type system to support DocumentAnnotation Types as well?
> Can we change that within the source?
>
> Thank you
>
>
>
> Am 21.08.2012 17:14, schrieb Andreas Niekler:
>
> UPDATE:
>>
>> It seems that if i'm doing something like this:
>>
>> <type name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Date" field="Date" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Id" field="_id" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Company" field="Company" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="PublicationType" field="PublicationType" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Language" field="Language" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Creator" field="Creator" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Page" field="Page" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Section" field="Section" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="Subsection" field="Subsection" />
>> </type>
>> <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>> <map feature="NewsAgency" field="NewsAgency" />
>> </type>
>>
>> That the values are overwritten somehow. Could that be?
>>
>> Thank you
>>
>> Andreas
>>
>> Am 21.08.2012 16:39, schrieb Andreas Niekler:
>>
>>> Hello,
>>>
>>> i'm currently working with the SolrCas Consumer. I set up a mapping file
>>> and a schema for the Solr instance. The problem i'm facing is that some
>>> values are written to the index fields an some not. In the attached
>>> mapping file i have the
>>> de.uni_leipzig.informatik.asv.**uima.cpe.DocumentMetadata Annotation
>>> and i
>>> take several features from that annotation. The only feature that is
>>> correctly written to the index is the Id Feature. I defined that as the
>>> <uniqueKey>_id</uniqueKey>. All other Features of type
>>> de.uni_leipzig.informatik.asv.**uima.cpe.DocumentMetadata are not
>>> written
>>> to the index by the SolrCasConsumer.
>>>
>>> Can anybody tell me why this could happen? I appended my schema.xml as
>>> well
>>>
>>> Thank you
>>>
>>> Andreas
>>>
>>
>>
> --
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
>
> mail: aniekler@informatik.uni-**leipzig.deg.de<an...@informatik.uni-leipzig.deg.de>
>

Re: SolrCas does not store all features

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Andreas,

I'm taking a look at your use case and I think it makes sense to support
DocumentAnnotation types too however I didn't have a chance to test it yet.
Therefore feel free to open a Jira issue and eventually submit a patch for
that.

Thanks and regards,
Tommaso


2012/8/21 Andreas Niekler <an...@informatik.uni-leipzig.de>

> Hello,
>
> i had a look into the source code of the annotator and actually it is
> using a HashMap to store the values by the type name. Would it be better to
> use a list for the type system to support DocumentAnnotation Types as well?
> Can we change that within the source?
>
> Thank you
>
>
>
> Am 21.08.2012 17:14, schrieb Andreas Niekler:
>
>  UPDATE:
>>
>> It seems that if i'm doing something like this:
>>
>> <type name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Date" field="Date" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Id" field="_id" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Company" field="Company" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="PublicationType" field="PublicationType" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Language" field="Language" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Creator" field="Creator" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Page" field="Page" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Section" field="Section" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="Subsection" field="Subsection" />
>>          </type>
>>          <type
>> name="de.uni_leipzig.**informatik.asv.uima.cpe.**DocumentMetadata">
>>              <map feature="NewsAgency" field="NewsAgency" />
>>          </type>
>>
>> That the values are overwritten somehow. Could that be?
>>
>> Thank you
>>
>> Andreas
>>
>> Am 21.08.2012 16:39, schrieb Andreas Niekler:
>>
>>> Hello,
>>>
>>> i'm currently working with the SolrCas Consumer. I set up a mapping file
>>> and a schema for the Solr instance. The problem i'm facing is that some
>>> values are written to the index fields an some not. In the attached
>>> mapping file i have the
>>> de.uni_leipzig.informatik.asv.**uima.cpe.DocumentMetadata Annotation
>>> and i
>>> take several features from that annotation. The only feature that is
>>> correctly written to the index is the Id Feature. I defined that as the
>>> <uniqueKey>_id</uniqueKey>. All other Features of type
>>> de.uni_leipzig.informatik.asv.**uima.cpe.DocumentMetadata are not
>>> written
>>> to the index by the SolrCasConsumer.
>>>
>>> Can anybody tell me why this could happen? I appended my schema.xml as
>>> well
>>>
>>> Thank you
>>>
>>> Andreas
>>>
>>
>>
> --
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
>
> mail: aniekler@informatik.uni-**leipzig.deg.de<an...@informatik.uni-leipzig.deg.de>
>

Re: SolrCas does not store all features

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hello,

i had a look into the source code of the annotator and actually it is 
using a HashMap to store the values by the type name. Would it be better 
to use a list for the type system to support DocumentAnnotation Types as 
well? Can we change that within the source?

Thank you



Am 21.08.2012 17:14, schrieb Andreas Niekler:
> UPDATE:
>
> It seems that if i'm doing something like this:
>
> <type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Date" field="Date" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Id" field="_id" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Company" field="Company" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="PublicationType" field="PublicationType" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Language" field="Language" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Creator" field="Creator" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Page" field="Page" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Section" field="Section" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="Subsection" field="Subsection" />
>          </type>
>          <type
> name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
>              <map feature="NewsAgency" field="NewsAgency" />
>          </type>
>
> That the values are overwritten somehow. Could that be?
>
> Thank you
>
> Andreas
>
> Am 21.08.2012 16:39, schrieb Andreas Niekler:
>> Hello,
>>
>> i'm currently working with the SolrCas Consumer. I set up a mapping file
>> and a schema for the Solr instance. The problem i'm facing is that some
>> values are written to the index fields an some not. In the attached
>> mapping file i have the
>> de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata Annotation and i
>> take several features from that annotation. The only feature that is
>> correctly written to the index is the Id Feature. I defined that as the
>> <uniqueKey>_id</uniqueKey>. All other Features of type
>> de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata are not written
>> to the index by the SolrCasConsumer.
>>
>> Can anybody tell me why this could happen? I appended my schema.xml as
>> well
>>
>> Thank you
>>
>> Andreas
>

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: SolrCas does not store all features

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

UPDATE:

It seems that if i'm doing something like this:

<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Date" field="Date" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Id" field="_id" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Company" field="Company" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="PublicationType" field="PublicationType" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Language" field="Language" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Creator" field="Creator" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Page" field="Page" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Section" field="Section" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="Subsection" field="Subsection" />
		</type>
		<type name="de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata">
			<map feature="NewsAgency" field="NewsAgency" />
		</type>

That the values are overwritten somehow. Could that be?

Thank you

Andreas

Am 21.08.2012 16:39, schrieb Andreas Niekler:
> Hello,
>
> i'm currently working with the SolrCas Consumer. I set up a mapping file
> and a schema for the Solr instance. The problem i'm facing is that some
> values are written to the index fields an some not. In the attached
> mapping file i have the
> de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata Annotation and i
> take several features from that annotation. The only feature that is
> correctly written to the index is the Id Feature. I defined that as the
> <uniqueKey>_id</uniqueKey>. All other Features of type
> de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata are not written
> to the index by the SolrCasConsumer.
>
> Can anybody tell me why this could happen? I appended my schema.xml as well
>
> Thank you
>
> Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

SolrCas does not store all features

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hello,

i'm currently working with the SolrCas Consumer. I set up a mapping file 
and a schema for the Solr instance. The problem i'm facing is that some 
values are written to the index fields an some not. In the attached 
mapping file i have the 
de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata Annotation and i 
take several features from that annotation. The only feature that is 
correctly written to the index is the Id Feature. I defined that as the 
<uniqueKey>_id</uniqueKey>. All other Features of type 
de.uni_leipzig.informatik.asv.uima.cpe.DocumentMetadata are not written 
to the index by the SolrCasConsumer.

Can anybody tell me why this could happen? I appended my schema.xml as well

Thank you

Andreas

Re: Multiple views in Annotator Engine

Posted by Florian Leitner <fl...@cnio.es>.

As you have figured out already, the "container" is simply a subclass of an Annotation type (i.e., with offsets). If you check the OpenNLP code, you will see that all you need to do is tell the SentenceDetector the (fully qualified) name of your type as a String value, say "org.myorg.tcas.MyAnnotationType"

This type name string can be set as a parameter in the descriptor of the SentenceDetector annotator; the parameter is/should be named:

opennlp.uima.ContainerType

If set, the SentenceDetector will only split text into sentences that are inside the spans annotated by this "ContainerType".

Hope these instructions were clear enough - if you need more details, let me know. Here is the link to the XML descriptor for the OpenNLP SentenceDetector annotator once more:

https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-uima/descriptors/SentenceDetector.xml

Cheers,
Florian

On 17 Aug 2012, at 15:42, Andreas Niekler wrote:

> Hello,
> 
> sorry for the new reply! But i forgot to ask how i can pass this type to the annotator than
> 
> Thanks again
> 
> Andreas
> 
> Am 17.08.2012 15:34, schrieb Andreas Niekler:
>> Hello,
>> 
>> thanks a lot. But how can i exactly define a container type which should
>> be an AnnotationFS i guess. An how do i pass the container information
>> to the annotator than? Due to the missing documentation for the openNLP
>> UIMA Wrapper i get your point but don't know how to impement a
>> Collection Reader that can create such containers.
>> 
>> 
>> Am 17.08.2012 12:41, schrieb Florian Leitner:
>>> As far as the OpenNLP philosophy goes, you'd use a container type that
>>> would determine which part of the SOFA is a title, subtitle, document,
>>> or any other content you are interested in sentence-segmenting and
>>> only process text within that particular container type, while the
>>> default is to process the entire content if no container type is set;
>> 
>> 
>> Thanks a lot
>> 
> 
> -- 
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
> 
> mail: aniekler@informatik.uni-leipzig.deg.de

-- 
Florian Leitner, PhD <fl...@gmail.com>

Structural Biology and BioComputing Programme
Spanish National Cancer Research Centre (CNIO)

Address: C/ Melchor Fernandez Almagro 3; E-28029 Madrid
Phone: +34 91 732 8000
Fax: +34 91 224 6980
Internet: http://www.cnio.es

Re: Multiple views in Annotator Engine

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hello,

sorry for the new reply! But i forgot to ask how i can pass this type to 
the annotator than

Thanks again

Andreas

Am 17.08.2012 15:34, schrieb Andreas Niekler:
> Hello,
>
> thanks a lot. But how can i exactly define a container type which should
> be an AnnotationFS i guess. An how do i pass the container information
> to the annotator than? Due to the missing documentation for the openNLP
> UIMA Wrapper i get your point but don't know how to impement a
> Collection Reader that can create such containers.
>
>
> Am 17.08.2012 12:41, schrieb Florian Leitner:
>> As far as the OpenNLP philosophy goes, you'd use a container type that
>> would determine which part of the SOFA is a title, subtitle, document,
>> or any other content you are interested in sentence-segmenting and
>> only process text within that particular container type, while the
>> default is to process the entire content if no container type is set;
>
>
> Thanks a lot
>

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: Multiple views in Annotator Engine

Posted by Andreas Niekler <an...@informatik.uni-leipzig.de>.

Hello,

thanks a lot. But how can i exactly define a container type which should 
be an AnnotationFS i guess. An how do i pass the container information 
to the annotator than? Due to the missing documentation for the openNLP 
UIMA Wrapper i get your point but don't know how to impement a 
Collection Reader that can create such containers.


Am 17.08.2012 12:41, schrieb Florian Leitner:
> As far as the OpenNLP philosophy goes, you'd use a container type that would determine which part of the SOFA is a title, subtitle, document, or any other content you are interested in sentence-segmenting and only process text within that particular container type, while the default is to process the entire content if no container type is set;


Thanks a lot

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: aniekler@informatik.uni-leipzig.deg.de

Re: Multiple views in Annotator Engine

Posted by Florian Leitner <fl...@cnio.es>.

Hi Andreas,

Your question should probably be directed to the OpenNLP guys, I think. However, I am using OpenNLP with UIMA and can tell you that the OpenNLP SentenceDetector only works in "single view mode".

See the AE code at:

https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/sentdetect/AbstractSentenceDetector.java
https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/sentdetect/SentenceDetector.java

and the corresponding descriptor file:

https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-uima/descriptors/SentenceDetector.xml

As far as the OpenNLP philosophy goes, you'd use a container type that would determine which part of the SOFA is a title, subtitle, document, or any other content you are interested in sentence-segmenting and only process text within that particular container type, while the default is to process the entire content if no container type is set; see AbstractSentenceDector#process(CAS)

If you'd really need to process multiple views, you could use multiple, aggregate SOFA/view mappings (see

http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html#ugr.tug.mvs.specifying_cas_view_for_single_view

) or write a wrapper around the OpenNLP annotator that works with multiple views; something like this should work, too:

public class MySentenceAnnotator extends SentenceDetector {
	@Override
	public void process(CAS cas) throws AnalysisEngineProcessException {
		super.process(cas.getView("view-a"));
		super.process(cas.getView("view-b"));
		// asf...
	}
}
	
(Footnote: Naturally, you'd probably do that by using an array init parameter for the views you wish to process and a loop instead of the hardcoded string constants, this is just to show the basic idea...)

On a side-note, to me it seems a bit "fishy" that you are trying to split your SOFA into views depending on whether the relevant bit is a title, subtitle, body or any other part of the SOFA. In this respect, I think the OpenNLP approach with a "container annotation type" feels more UIMA-like: Views should be *different* views of the *same* content (e.g., different languages, raw [byte] document content vs. plain text, etc.), and not the "same" view [types] of different content. 

Hope this helps a bit!

Cheers,
Florian

On 16 Aug 2012, at 21:28, Andreas Niekler wrote:

> Hello,
> 
> i wonder if it is possible to define multiple sofa's (views) in a UIMA Collection Reader and pass those differnt contents to the sentence annotator of the openNLP Tools. Will there be a sentence annotation for each sofa (view) or does openNLP UIMA automatically choose the first sofa in the data?
> 
> How could i implement such a CAS case where i'm able to annotate title, document and subtitle (for example) seperately in one chain?
> 
> Thank you
> 
> Andreas

-- 
Florian Leitner, PhD <fl...@gmail.com>

Structural Biology and BioComputing Programme
Spanish National Cancer Research Centre (CNIO)

Address: C/ Melchor Fernandez Almagro 3; E-28029 Madrid
Phone: +34 91 732 8000
Fax: +34 91 224 6980
Internet: http://www.cnio.es

RE: Multiple views in Annotator Engine

Posted by Torsten Zesch <ze...@ukp.informatik.tu-darmstadt.de>.

Hi Andreas,

> i wonder if it is possible to define multiple sofa's (views) in a UIMA Collection
> Reader and pass those differnt contents to the sentence annotator of the
> openNLP Tools. Will there be a sentence annotation for each sofa (view) or
> does openNLP UIMA automatically choose the first sofa in the data?

Yes, this is possible. 
You need to use SofaMapping in order to tell the sentence annotator which view it should work on.
 
> How could i implement such a CAS case where i'm able to annotate title,
> document and subtitle (for example) seperately in one chain?

a) Create a component that annotates title, document, and subtitle in the CAS (you might need to introduce new types for that). Then they are not "really" annotated separately, but you will be able to easily retrieve the other annotations for each part later.
b) Put everything in a view of its own.

HTH,
Torsten