You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Julien Nioche <li...@gmail.com> on 2008/09/12 18:13:02 UTC

Donate TIkaAnnotator to Sandbox

Dear UIMA devs,

We have recently developed an AnnotationReader for UIMA which uses Tika to
convert the markup into annotations. The resource consists of a
CollectionReader, a CasAnnotator and a utility class which can populate a
cas with markup annotations. It is certainly not perfect but it does a
decent job. The type system is inspired see
http://cwiki.apache.org/UIMA/uima-sandbox-components.html

I would be more than happy to donate the code to the Sandbox. What is the
procedure for that?

Have a good week end

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Donate TIkaAnnotator to Sandbox

Posted by Marshall Schor <ms...@schor.com>.
I've been swampped with other things - but I mentioned this in the
quarterly board report :-).  I'll try to take a look soon...

-Marshall

Julien Nioche wrote:
> Hi guys,
>
> Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a try? Any
> thoughts on it?
>
> Best,
>
> J.
>
>   

Re: Donate TIkaAnnotator to Sandbox

Posted by Thilo Goetz <tw...@gmx.de>.
Jörn Kottmann wrote:
> 
> On Oct 10, 2008, at 3:28 PM, Julien Nioche wrote:
> 
>> Hi Jorn,
>>
>> The MarkupAnnotator could be useful for many UIMA users, including me.
>> Maybe
>>> it would be nice if its possible to configure the types used for markup.
>>
>>
>> that's a general issue with UIMA and most AEs rely on a preexisting
>> typeset.
>> The solution would be to have a generic resource for mapping between
>> files
>> (e.g. feature named X in type T is the same as feature Y in type U) or
>> rule
>> based system allowing you to do manipulate annotations (like JAPE in
>> GATE).
> 
> No, I was talking about a type mapping which could be configured in the
> descriptor
> for example.
> 
> Is there anything which must be discussed before Julien can
> start the vote ?

I haven't had time to look at the code myself, but you and Thomas have,
and I have seen other contributions of Julien, so that's good enough
for me.  We've been talking about a Tika annotator for a while, so
let's go for it.  Please do start a formal vote.

Julien, since you are contributing more, it would be good if you could
send in an ICLA (individual contributor licensing agreement).  It's a
relatively painless form that you need to send to the ASF secretary
which helps us to keep our IP story clean.  See the UIMA web site
http://incubator.apache.org/uima/get-involved.html
and the ASF page here:
http://www.apache.org/licenses/#clas

--Thilo

> 
> Jörn
> 


Re: Donate TIkaAnnotator to Sandbox

Posted by Jörn Kottmann <ko...@gmail.com>.
On Oct 10, 2008, at 3:28 PM, Julien Nioche wrote:

> Hi Jorn,
>
> The MarkupAnnotator could be useful for many UIMA users, including  
> me. Maybe
>> it would be nice if its possible to configure the types used for  
>> markup.
>
>
> that's a general issue with UIMA and most AEs rely on a preexisting  
> typeset.
> The solution would be to have a generic resource for mapping between  
> files
> (e.g. feature named X in type T is the same as feature Y in type U)  
> or rule
> based system allowing you to do manipulate annotations (like JAPE in  
> GATE).

No, I was talking about a type mapping which could be configured in  
the descriptor
for example.

Is there anything which must be discussed before Julien can
start the vote ?

Jörn



Re: Donate TIkaAnnotator to Sandbox

Posted by Julien Nioche <li...@gmail.com>.
Hi Jorn,

The MarkupAnnotator could be useful for many UIMA users, including me. Maybe
> it would be nice if its possible to configure the types used for markup.


that's a general issue with UIMA and most AEs rely on a preexisting typeset.
The solution would be to have a generic resource for mapping between files
(e.g. feature named X in type T is the same as feature Y in type U) or rule
based system allowing you to do manipulate annotations (like JAPE in GATE).
Or did you mean having a way to specify which markups will be converted into
annotations?
thank you for your feedback
Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Donate TIkaAnnotator to Sandbox

Posted by Jörn Kottmann <ko...@gmail.com>.
> Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a  
> try? Any
> thoughts on it?

I looked through the code.

The MarkupAnnotator could be useful for many UIMA users,
including me. Maybe it would be nice if its possible to configure
the types used for markup.

Thanks,
Jörn


Re: Donate TIkaAnnotator to Sandbox

Posted by Julien Nioche <li...@gmail.com>.
Hi Thomas

I gave it a first try. I have just used it and did not seriously look at the
> code yet. Here is some initial, unsorted user feedback:
> - Having a binary TIKA jar would speed things up (needed help to get that
> built)
> - It worked fine for me once I got the jar

the Tika jar will be available in the sandbox (if the TikaAnnotator ever
gets there). Can't put binaries in a diff file


>
> - In my initial trial setup I added both the Tika CollectionReader and the
>  TIKA MarkupAnnotator to a CPE flow assuming that's what's needed. Only
> after
>  overcoming some confusion about the resulting CASes I realized that they
> are
>  intended to be used either/or. A word in the README may spare other people
>  the confusion.

Have added a line on this in the README


>
> - MarkupAnnotator.xml states <outputsNewCASes>true</outputsNewCASes>. CVD
> will
>  not show any results for annotators with that setting. And in fact the
>  annotator runs just fine with that setting changed to false. From what I
>  could see in the code it just creates a new view not a new CAS. But maybe
> I
>  am missing something here.

I changed it to false - can't remember why it was set to true in the first
place


>
> - It returned reasonable results on a few HTML, MS-Word and PPT files I
> tried.
>  I silently refused to covert one PDF file (others worked). But I guess
> this
>  are just limitations of the current PDF parser.

do you get the extracted text at least?


>
> - As I understand it TIKA maps all document markup to the XHTML tagset.
> Since
>  that is a closed set it should be possible to use a more explicit
> typesystem
>  modeling, where the known XHTML elements like title, body, p etc. are
>  modeled as explicit subtypes instead of having only one generic type
>  MarkupAnnotation. Is that assumption correct?

indeed but I think there is no strong constraint on the names of the
elements and most of it relies on convention (at each parser's level). This
means that the values can change. People could develop alternative parser or
parsers for new formats which would not follow the XHTML conventions. I
would rather not make too many assumptions as to what is returned by Tika
and return generic annotations.


>
>  Which typesystem representation to use depends on use case (and taste :-)
>  but finding and iterating over the different parts of the markup would be
>  easier with explicit types.

I suppose once could easily write a custom resource for converting the
annotations types returned by the TikaAnnotator into more explicit types if
necessary.

Thank you for your feedback
Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Donate TIkaAnnotator to Sandbox

Posted by Thomas Hampp <th...@de.ibm.com>.
Julien Nioche <li...@...> writes:

> 
> Hi guys,
> 
> Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a try? Any
> thoughts on it?
> 
> Best,
> 
> J.
> 
Hi Julien,

Thanks for that contribution. I think that kind of functionality is important
for UIMA.

I gave it a first try. I have just used it and did not seriously look at the
code yet. Here is some initial, unsorted user feedback:
- Having a binary TIKA jar would speed things up (needed help to get that built)
- It worked fine for me once I got the jar
- In my initial trial setup I added both the Tika CollectionReader and the 
  TIKA MarkupAnnotator to a CPE flow assuming that's what's needed. Only after 
  overcoming some confusion about the resulting CASes I realized that they are 
  intended to be used either/or. A word in the README may spare other people 
  the confusion.
- MarkupAnnotator.xml states <outputsNewCASes>true</outputsNewCASes>. CVD will 
  not show any results for annotators with that setting. And in fact the 
  annotator runs just fine with that setting changed to false. From what I 
  could see in the code it just creates a new view not a new CAS. But maybe I 
  am missing something here.
- It returned reasonable results on a few HTML, MS-Word and PPT files I tried. 
  I silently refused to covert one PDF file (others worked). But I guess this 
  are just limitations of the current PDF parser.
- The typesystem does have the necessary information needed for further 
  processing. 
- As I understand it TIKA maps all document markup to the XHTML tagset. Since 
  that is a closed set it should be possible to use a more explicit typesystem 
  modeling, where the known XHTML elements like title, body, p etc. are 
  modeled as explicit subtypes instead of having only one generic type 
  MarkupAnnotation. Is that assumption correct?
  Which typesystem representation to use depends on use case (and taste :-) 
  but finding and iterating over the different parts of the markup would be 
  easier with explicit types.
- I think for document level meta data attributes the situation is 
  different since it's open (but there may be core set as well).

So far for the first impressions. Good work.

- Thomas





Re: Donate TIkaAnnotator to Sandbox

Posted by Julien Nioche <li...@gmail.com>.
Hi guys,

Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a try? Any
thoughts on it?

Best,

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2008/9/22 Julien Nioche <li...@gmail.com>

> The Tika jar was to large to be uploaded anyway so I've put a patch file on
> https://issues.apache.org/jira/browse/UIMA-1095
>
> Cheers
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2008/9/22 Julien Nioche <li...@gmail.com>
>
>> I'll put a tar.gz instead if you don't mind - this resources relies on a
>> binary Tika jar in lib directory so patches won't help with that
>>
>>  So, Julien - I
>>> suggest you consider posting your potential donation as patches to this
>>> Jira, after you've made sure that the IP is ok.
>>>
>>>
>
>
>

Re: Donate TIkaAnnotator to Sandbox

Posted by Julien Nioche <li...@gmail.com>.
The Tika jar was to large to be uploaded anyway so I've put a patch file on
https://issues.apache.org/jira/browse/UIMA-1095

Cheers

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2008/9/22 Julien Nioche <li...@gmail.com>

> I'll put a tar.gz instead if you don't mind - this resources relies on a
> binary Tika jar in lib directory so patches won't help with that
>
>  So, Julien - I
>> suggest you consider posting your potential donation as patches to this
>> Jira, after you've made sure that the IP is ok.
>>
>>

Re: Donate TIkaAnnotator to Sandbox

Posted by Julien Nioche <li...@gmail.com>.
I'll put a tar.gz instead if you don't mind - this resources relies on a
binary Tika jar in lib directory so patches won't help with that

 So, Julien - I
> suggest you consider posting your potential donation as patches to this
> Jira, after you've made sure that the IP is ok.
>
>
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Donate TIkaAnnotator to Sandbox

Posted by Marshall Schor <ms...@schor.com>.

Jörn Kottmann wrote:
>
> On Sep 12, 2008, at 6:13 PM, Julien Nioche wrote:
>
>> Dear UIMA devs,
>>
>> We have recently developed an AnnotationReader for UIMA which uses
>> Tika to
>> convert the markup into annotations. The resource consists of a
>> CollectionReader, a CasAnnotator and a utility class which can
>> populate a
>> cas with markup annotations. It is certainly not perfect but it does a
>> decent job. The type system is inspired see
>> http://cwiki.apache.org/UIMA/uima-sandbox-components.html
>
> There is already a jira issue to add a Tika Annotator to the sandbox
> https://issues.apache.org/jira/browse/UIMA-1095
>
Thanks Jörn.  This, of course, just means there's already interest in
doing this :-) - not that some one else has done it.  So, Julien - I
suggest you consider posting your potential donation as patches to this
Jira, after you've made sure that the IP is ok.

-Marshall


> Jörn
>

Re: Donate TIkaAnnotator to Sandbox

Posted by Jörn Kottmann <ko...@gmail.com>.
On Sep 12, 2008, at 6:13 PM, Julien Nioche wrote:

> Dear UIMA devs,
>
> We have recently developed an AnnotationReader for UIMA which uses  
> Tika to
> convert the markup into annotations. The resource consists of a
> CollectionReader, a CasAnnotator and a utility class which can  
> populate a
> cas with markup annotations. It is certainly not perfect but it does a
> decent job. The type system is inspired see
> http://cwiki.apache.org/UIMA/uima-sandbox-components.html

There is already a jira issue to add a Tika Annotator to the sandbox
https://issues.apache.org/jira/browse/UIMA-1095

Jörn

Re: Donate TIkaAnnotator to Sandbox

Posted by Marshall Schor <ms...@schor.com>.
Hi Julien,

The process for donating to the sandbox is as follows:

a) propose something you'd like to donate, on the uima-users or uima-dev
list, and see if the community is interested.  This is often a bit hard
to gauge - you can't always take non-response as a sign of
non-interest.  If you get no response, say, after 3 days, please re-post
(perhaps by replying to your own initial post) asking for some
response.  (What can I say - people are busy, go on vacations, etc... -
so it's always worth reposting if there is no response...)

b) if there is some discussion, participate, and see where it goes.  If
it seems that there is some consensus that the community wants the
donation, then create a Jira and attach the donation as a zip file or
whatever makes sense.

  -- if the donation is *large*, we may ask for an Apache Software Grant
( http://www.apache.org/licenses/software-grant.txt ).

c) The community may then have more discussion, mainly to resolve any IP
issues and insure that the code has a good chance of community
involvement in on-going maintenance - although for *sandbox* project,
the bar is somewhat lower here - to encourage submissions :-).

d) Following the discussion, the concensus will be formalized by a vote
to decide to bring the donation into the sandbox.  After the vote and
confirming that the Software Grant (if needed) is recorded, the code
will be put into the sandbox.

e) Following that will be an IP - clearance form fill-out and recording
(which is done by one of our mentors with our help).  For reference -
the IP - clearance form is here:
http://incubator.apache.org/ip-clearance/ip-clearance-template.html

-Marshall

P.S. - if I got any of these steps wrong, please correct.  I intend to
post the corrected version to the website, for future reference.

Julien Nioche wrote:
> Dear UIMA devs,
>
> We have recently developed an AnnotationReader for UIMA which uses Tika to
> convert the markup into annotations. The resource consists of a
> CollectionReader, a CasAnnotator and a utility class which can populate a
> cas with markup annotations. It is certainly not perfect but it does a
> decent job. The type system is inspired see
> http://cwiki.apache.org/UIMA/uima-sandbox-components.html
>
> I would be more than happy to donate the code to the Sandbox. What is the
> procedure for that?
>
> Have a good week end
>
> Julien
>