You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by abhishek <ab...@sqlstar.com> on 2011/09/29 08:28:02 UTC

UIMA- Support for HTML, PDF, Doc files

Hi,
While reading the docuemntation of UIMA, i found out that UIMA&nbsp;supports&nbsp;html files.
&nbsp;
However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to understand the text.
&nbsp;
Kindly let me know, the correct way to read these type of files.
&nbsp;

Re: UIMA- Support for HTML, PDF, Doc files

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Have a look at the TikaAnnotator in the sandbox. It extracts the text and
metadata from various document formats and converts any available markup
into annotations

HTH

Julien


On 29 September 2011 07:28, abhishek <ab...@sqlstar.com> wrote:

> Hi,
> While reading the docuemntation of UIMA, i found out that
> UIMA&nbsp;supports&nbsp;html files.
> &nbsp;
> However, when i am running the
> org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to
> understand the text.
> &nbsp;
> Kindly let me know, the correct way to read these type of files.
> &nbsp;




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: UIMA- Support for HTML, PDF, Doc files

Posted by Jörn Kottmann <ko...@gmail.com>.

Hello,

UIMA itself is just a framework to build analysis pipelines. To analyze 
HTML, PDF or Word documents
you need a component which can extract the text from these formats.

You can use Apache Tika together with our Tika integration in the addons 
project
to extract text from various data formats.

Jörn

On 9/29/11 8:28 AM, abhishek wrote:
> Hi,
> While reading the docuemntation of UIMA, i found out that UIMA&nbsp;supports&nbsp;html files.
> &nbsp;
> However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to understand the text.
> &nbsp;
> Kindly let me know, the correct way to read these type of files.
> &nbsp;