You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by "Jaquiss, Robert" <RJ...@nfb.org> on 2001/11/16 21:43:37 UTC

Looking for tools/ideas for filtering HTML

Hello:
 
     I have just joined this list, and am also a beginning Java
programmer. I appologize if this is not a suitable question for this
list. I need to write a filter for HTML pages. My goal is to read an
HTML page, throwing away all the HTML code and just keeping a block of
text that occurs near the bottom of the page. The HTML tags are liable
to be unbalanced. There will be a <P> but no </P>. I found a sample
program that used the SAXparser, but it SAXparser doesn't seem to handle
unbalanced tags. Ideas/comments would be appreciated.  Thank you.
 
    Regards
   Robert Jaquiss
 

Re: Looking for tools/ideas for filtering HTML

Posted by Davanum Srinivas <di...@yahoo.com>.
Use JTidy - http://sourceforge.net/projects/jtidy/

Thanks,
dims

--- "Jaquiss, Robert" <RJ...@nfb.org> wrote:
> Hello:
>  
>      I have just joined this list, and am also a beginning Java
> programmer. I appologize if this is not a suitable question for this
> list. I need to write a filter for HTML pages. My goal is to read an
> HTML page, throwing away all the HTML code and just keeping a block of
> text that occurs near the bottom of the page. The HTML tags are liable
> to be unbalanced. There will be a <P> but no </P>. I found a sample
> program that used the SAXparser, but it SAXparser doesn't seem to handle
> unbalanced tags. Ideas/comments would be appreciated.  Thank you.
>  
>     Regards
>    Robert Jaquiss
>  
> 


=====
Davanum Srinivas - http://jguru.com/dims/

__________________________________________________
Do You Yahoo!?
Find the one for you at Yahoo! Personals
http://personals.yahoo.com

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


RE: Looking for tools/ideas for filtering HTML

Posted by ma...@corrosive.co.uk.
and this:

http://www.scrml.org


>You can take a look at some projects like:
>* JavaCC HTML Parser (http://www.quiotix.com/downloads/html-parser/)
>* HEX - The HTML Enabled XML Parser
>(http://www-uk.hpl.hp.com/people/sth/java/hex.html)
>
>Rgds,
>Neeme
>
>-----Original Message-----
>From: Jaquiss, Robert [mailto:RJaquiss@nfb.org]
>Sent: Friday, November 16, 2001 10:44 PM
>To: general@xml.apache.org
>Subject: Looking for tools/ideas for filtering HTML
>
>Hello:
>
>      I have just joined this list, and am also a beginning Java programmer.
>I appologize if this is not a suitable question for this list. I need to
>write a filter for HTML pages. My goal is to read an HTML page, throwing
>away all the HTML code and just keeping a block of text that occurs near the
>bottom of the page. The HTML tags are liable to be unbalanced. There will be
>a <P> but no </P>. I found a sample program that used the SAXparser, but it
>SAXparser doesn't seem to handle unbalanced tags. Ideas/comments would be
>appreciated.  Thank you.
>
>     Regards
>    Robert Jaquiss
>
>
>---------------------------------------------------------------------
>In case of troubles, e-mail:     webmaster@xml.apache.org
>To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
>For additional commands, e-mail: general-help@xml.apache.org


-- 

------------------------------

Max Guglielmino
Corrosive
http://www.corrosive.co.uk


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


RE: Looking for tools/ideas for filtering HTML

Posted by Neeme Praks <ne...@apache.org>.
You can take a look at some projects like:
* JavaCC HTML Parser (http://www.quiotix.com/downloads/html-parser/)
* HEX - The HTML Enabled XML Parser
(http://www-uk.hpl.hp.com/people/sth/java/hex.html)

Rgds,
Neeme

-----Original Message-----
From: Jaquiss, Robert [mailto:RJaquiss@nfb.org]
Sent: Friday, November 16, 2001 10:44 PM
To: general@xml.apache.org
Subject: Looking for tools/ideas for filtering HTML


Hello:

     I have just joined this list, and am also a beginning Java programmer.
I appologize if this is not a suitable question for this list. I need to
write a filter for HTML pages. My goal is to read an HTML page, throwing
away all the HTML code and just keeping a block of text that occurs near the
bottom of the page. The HTML tags are liable to be unbalanced. There will be
a <P> but no </P>. I found a sample program that used the SAXparser, but it
SAXparser doesn't seem to handle unbalanced tags. Ideas/comments would be
appreciated.  Thank you.

    Regards
   Robert Jaquiss


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org