You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Ekaterina Buyko <bu...@coling-uni-jena.de> on 2007/03/07 15:28:01 UTC
Re: Document "properties" and SourceDocumentInformation

Dear Greg,

You raised certainly a good point in that you deplore the lack of a 
commonly shared (standard) UIMA annotation scheme for NLP purposes. Such 
a scheme would enable a flexible plug-in of components developed at many 
different sites, worldwide.

We encountered the same problem in the context of BOOTStrep 
(www.bootstrep.org), a European STREP project in which we are heavily 
involved together with six international partners. We use UIMA as a 
common platform for developing NLP software for text mining in biology.
As part of our project activites, in the meantime, we developed a 
multi-layered UIMA annotation type system. This type system currently 
contains six spec layers: document meta information (author, title 
etc.), document structure and style information, morpho-syntax, syntax 
ans semantics (discourse to come). In our work, we integrated as much as 
possible already existing annotation schemes from the NLP community 
(such as TEI, Dublin Core, Penn Treebank etc.) The scheme is designed 
with domain-independence in mind though some portions (e.g., document 
structure and semantics, of course) introduce bits of domain-dependence. 
Coverage of general language applications (e.g., newspapers) should, 
however, not constitute a big deal.

Please see our paper at the up-coming UIMA workshop at GLDV 2007 
http://incubator.apache.org/uima/downloads/gldv/gldv07-uima-hahn.pdf

We are aware of the fact that other teams are working on the same 
challenges and we like the idea a lot to coordinate these efforts in 
order to find a way to elaborate on a common UIMA annotation scheme for 
NLP. Correspondingly, we find the idea to build a consortium in the UIMA 
Apache project really fascinating. It is certainly one way to speed up 
consensus on a commonly shared UIMA standard annotation scheme and 
create an international UIMA community.

Best regards from Jena

Ekaterina Buyko & Udo Hahn

-- 

Ekaterina Buyko
Jena University Language and Information Engineering (JULIE) Lab
Phone: +49-3641-944322
Fax:   +49-3641-944321
email: buyko@coling-uni-jena.de
URL:   http://www.coling.uni-jena.de


Thilo Goetz schrieb:
> greg@holmberg.name wrote:
>> What is the recommended way of storing document properties, such as 
>> "author", "date created", "title", etc?
>>
>> I also need some data for internal uses, such as the document size 
>> and URI.
>>
>> One other requirement: this is not a closed vertical solution with a 
>> known set of annotators designed to inter-operate.  This is an 
>> application platform that will use some known annotators but allow 
>> plugging in arbitrary unknown annotators from other companies (that's 
>> why one uses UIMA, of course!).  Also, some of our annotators may be 
>> used in UIMA containers from other companies with unknown 
>> annotators.  So my code can't depend on either the UIMA container 
>> providing, or all of the other annotators (but possibly our own) 
>> knowing about, any data structure containing these properties.
>>
>> I see a few possibilities:
>>
>> 1. Add features to DocumentAnnotation
>> 2. Add features to SourceDocumentInformation
>> 3. Create my own annotation or TOP FS.
>>
>> The documentation recommends not adding features to 
>> DocumentAnnotation if you are using JCas (I am).  I agree--what if 
>> both my annotators and someone else's annotator have added features 
>> to DA?  It just wouldn't work, right?
>>
>> It's the same with SDI, if two annotators both add features to it.  
>> They in conflict, and they can't be merged.
>>
>> SDI is useful however, since it has the document size and URI.  
>> Despite it being in a package called "examples", in truth it's become 
>> a standard.  All the annotators the ship with UIMA use it.  If you 
>> want to use the semantic search (Juru) indexing CAS Consumer, you 
>> have to use SDI.   I'm sure many annotators in the world have used SDI.
>>
>> I would like my annotators and UIMA container to be compatible with 
>> all those annotators.  Therefore, I think I have to use SDI for size 
>> and URI, but not modify it.
>>
>> Creating my own annotation (or is extending TOP FS better?) seems 
>> like the best answer.  My UIMA container and set of annotators would 
>> know about it, and other's annotators wouldn't be affected.  My 
>> annotators would have to gracefully degrade when running in a UIMA 
>> container that doesn't provide this new annotation.
>>
>> What are people's thoughts?  1, 2 or 3?
>
> If you use the JCas, as you say you do, definitely 3.  There is no 
> need to use an annotation, extending TOP would be sufficient.
>
>>
>> ================
>>
>> Longer term, I think we as a community need to define Type Systems 
>> that allow inter-operability of annotators and CAS Consumers.  For 
>> example, we could create an official SourceDocumentInformation that 
>> allows arbitrary sets of document properties as simple name-value 
>> pairs.  In other words, add this feature to SourceDocumentInformation:
>>
>>         properties           uima.cas.FSArray    PropertyFS
>>
>>     uima.PropertyFS    uima.cas.TOP
>>         name                  uima.cas.String
>>         value                   uima.cas.String
>>         scheme               uima.cas.String
>
> I'm personally not a big fan of arbitrary attribute-value schemes like 
> this.  You need yet another place (outside the type system) where you 
> document what the properties are that you define and expect.
>
>>
>> And define that names, values, and schemes conform to the Dublin Core 
>> Metadata Initiative standards.
>>
>>
>> Similarly, I think we need to create Type System standards for 
>> representing document structure.  For example, how could HTML 
>> elements and attributes be stored in the CAS such that all annotators 
>> could depend on them being there and therefore make intelligent use 
>> of them?
>>
>>
>> And finally, we need some Type System standards for representing 
>> certain common result annotations, such as lexical markup and named 
>> entities.  How can we combine two annotators from different companies 
>> if they don't have a shared definition of the data flowing between them?
>>
>>
>> And isn't this the whole point of UIMA?  It appears to me that the 
>> UIMA dream won't come true until we create these standards for data 
>> exchange or data transformation within the CAS.
>>
>> In my opinion, the current situation really limits the usefulness of 
>> UIMA as a platform for text processing (unless you control every 
>> piece of code in the system, of course).
>>
>> How do we start such a consortium?
>
> This mailing list is a good start ;-).  I know there are others who 
> work on similar things, but I'll let them speak for themselves.
>
> One issue of course is that it is difficult to agree on any common 
> type system.  It's hard enough to even agree on what an annotation is, 
> let alone specific types of annotations.  We could try to define a 
> certain base set on Apache.  I would hesitate to put more built-in 
> types into UIMA itself, though.  I'd rather have a type system 
> repository where we modularly define certain kinds of type systems 
> (such as html markup, for example), and that people can use, or not.
>
> --Thilo
>
>>
>> Thanks for listening,
>>
>>
>> Greg Holmberg