You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by abhishek <ab...@sqlstar.com> on 2011/09/29 08:28:02 UTC
UIMA- Support for HTML, PDF, Doc files
Hi,
While reading the docuemntation of UIMA, i found out that UIMA supports html files.
However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to understand the text.
Kindly let me know, the correct way to read these type of files.
Re: UIMA- Support for HTML, PDF, Doc files
Posted by Julien Nioche <li...@gmail.com>.
Hi,
Have a look at the TikaAnnotator in the sandbox. It extracts the text and
metadata from various document formats and converts any available markup
into annotations
HTH
Julien
On 29 September 2011 07:28, abhishek <ab...@sqlstar.com> wrote:
> Hi,
> While reading the docuemntation of UIMA, i found out that
> UIMA supports html files.
>
> However, when i am running the
> org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to
> understand the text.
>
> Kindly let me know, the correct way to read these type of files.
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: UIMA- Support for HTML, PDF, Doc files
Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,
UIMA itself is just a framework to build analysis pipelines. To analyze
HTML, PDF or Word documents
you need a component which can extract the text from these formats.
You can use Apache Tika together with our Tika integration in the addons
project
to extract text from various data formats.
Jörn
On 9/29/11 8:28 AM, abhishek wrote:
> Hi,
> While reading the docuemntation of UIMA, i found out that UIMA supports html files.
>
> However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to understand the text.
>
> Kindly let me know, the correct way to read these type of files.
>