You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Marshall Schor <ms...@schor.com> on 2009/05/22 04:01:39 UTC

Re: document structure

Hi Julien,

Can you write up a little something and submit a patch to the website?

-Marshall

Julien Nioche wrote:
> Hi,
>
> I contributed an annotator to the sandbox some time ago which uses Tika to
> convert original markup into UIMA annotations. It does not seem to be listed
> on the website but it should be in the SVN repository of the sandbox.
>
> Tika supports numerous formats such as PDF, XML, HTML etc...
>
> Julien
>
>   

Re: document structure

Posted by Marshall Schor <ms...@schor.com>.
I updated the UIMA website's sandbox page with this information.

-Marshall

Julien Nioche wrote:
> Hi Marshall,
>
> There is a description in the README.txt file from the TikaAnnotator
> repository, which I have slightly rewritten into the text below.
>
>
> *Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries. The TikaAnnotator uses Tika to generate annotations representing
> the original markup of a document, extract its text and metadata. It
> consists of three resources :
>
> - FileSystemCollectionReader : similar to the one in UIMA examples but uses
> TIKA to extract the text from binary documents and generates annotations to
> represent the markup
>
> - MarkupAnnotator : takes the original content from a view and generates a
> new view containing the extracted text with markup annotations
>
> - TikaWrapper : utility class which allows to populate a CAS from a binary
> document; used by the FileSystemCollectionReader *
>
>
> Best,
>
> J.
>
>   

Re: document structure

Posted by Julien Nioche <li...@gmail.com>.
Hi Marshall,

There is a description in the README.txt file from the TikaAnnotator
repository, which I have slightly rewritten into the text below.


*Apache Tika is a toolkit for detecting and extracting metadata and
structured text content from various documents using existing parser
libraries. The TikaAnnotator uses Tika to generate annotations representing
the original markup of a document, extract its text and metadata. It
consists of three resources :

- FileSystemCollectionReader : similar to the one in UIMA examples but uses
TIKA to extract the text from binary documents and generates annotations to
represent the markup

- MarkupAnnotator : takes the original content from a view and generates a
new view containing the extracted text with markup annotations

- TikaWrapper : utility class which allows to populate a CAS from a binary
document; used by the FileSystemCollectionReader *


Best,

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/5/22 Marshall Schor <ms...@schor.com>

> Hi Julien,
>
> Can you write up a little something and submit a patch to the website?
>
> -Marshall
>
> Julien Nioche wrote:
> > Hi,
> >
> > I contributed an annotator to the sandbox some time ago which uses Tika
> to
> > convert original markup into UIMA annotations. It does not seem to be
> listed
> > on the website but it should be in the SVN repository of the sandbox.
> >
> > Tika supports numerous formats such as PDF, XML, HTML etc...
> >
> > Julien
> >
> >
>