You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Sam Fisher <sa...@gmail.com> on 2008/03/12 18:29:32 UTC

parsing html as a string from getDocumentText

Hi All,

Having played around with plain text files in UIMA, I'm now inputting an 
html file to the Document Analyzer.  The jcas holds the contents of this 
file, both mark up and text, as a text string.  After reading through 
the markmail archives, I decide to try using the jericho html parser for 
extracting the plain text content from the html string (e.g. String 
theHtml = jcas.getDocumentText()). I'm probably not using Jericho 
correctly, because the output of the parser is the same as what went in 
(not stripped down to only the text content).

So that I bark up the right tree, I wonder if the CAS forces some kind 
of encoding, like UTF-8, that might cause the parser to be blind to the 
mark up tags in the html string?  This seems ridiculous, but I thought 
I'd ask.

Has anyone had success using jericho with uima?


Many thanks,

Sam

Re: parsing html as a string from getDocumentText

Posted by Sam Fisher <sa...@gmail.com>.
Hi Roman,

You confirmed I wasn't losing my mind, but that I was negligent to the 
configuration of my AE -- I was running the Whiteboard2 flow controller 
instead of fixed flow, so another, older and to-be-discarded annotator 
was writing into it.  Time to throw out the trash! (Works fine now.)

Good learning experience. Thanks very much for your help.

-Sam

Roman Klinger wrote:
> Dear Sam,
>
> Sam Fisher wrote:
>> I'm probably not using Jericho correctly, because the output of the 
>> parser is the same as what went in (not stripped down to only the 
>> text content).
>>
>>   
>
> I also think so ;-). I experimented with Jericho in UIMA and did not 
> have any problems.
>
>> Has anyone had success using jericho with uima?
>>
>>   
>
> How did you use Jericho?
>
> I did not have any problems with
>
> new Source(new 
> StringReader("<html>Te<b>s</b>t<html>")).getTextExtractor();
>
> or in UIMA with
>
> new Source(new StringReader(jCas.getDocumentText())).getTextExtractor();
>
>
> Best regards,
> Roman
>
>

Re: parsing html as a string from getDocumentText

Posted by Roman Klinger <ro...@scai.fraunhofer.de>.
Dear Sam,

Sam Fisher wrote:
> I'm probably not using Jericho 
> correctly, because the output of the parser is the same as what went in 
> (not stripped down to only the text content).
>
>   

I also think so ;-). I experimented with Jericho in UIMA and did not 
have any problems.

> Has anyone had success using jericho with uima?
>
>   

How did you use Jericho?

I did not have any problems with

new Source(new StringReader("<html>Te<b>s</b>t<html>")).getTextExtractor();

or in UIMA with

new Source(new StringReader(jCas.getDocumentText())).getTextExtractor();


Best regards,
Roman


-- 
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger@scai.fhg.de