You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jingkang Zhang <zj...@yahoo.com.cn> on 2005/02/01 10:14:44 UTC

which HTML parser is better?

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_________________________________________________________
Do You Yahoo!?
150万曲MP3疯狂搜，带您闯入音乐殿堂
http://music.yisou.com/
美女明星应有尽有，搜遍美图、艳图和酷图
http://image.yisou.com
1G就是1000兆，雅虎电邮自助扩容！
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.

Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>  
>

maybe you can try this library...

http://htmlparser.sourceforge.net/

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println("1:" + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))
writer.write(text);
}

Sergiu

>_________________________________________________________
>Do You Yahoo!?
>150万曲MP3疯狂搜，带您闯入音乐殿堂
>http://music.yisou.com/
>美女明星应有尽有，搜遍美图、艳图和酷图
>http://image.yisou.com
>1G就是1000兆，雅虎电邮自助扩容！
>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better? - Thread closed

Posted by Karl Koch <Th...@gmx.net>.

Thank you, I will do that.

> Karl Koch wrote:
> 
> >I appologise in advance, if some of my writing here has been said before.
> >The last three answers to my question have been suggesting pattern
> matching
> >solutions and Swing. Pattern matching was introduced in Java 1.4 and
> Swing
> >is something I cannot use since I work with Java 1.1 on a PDA.
> >  
> >
> I see,
> 
> In this case you can read line by line your HTML file and then write 
> something like this:
> 
> String line;
> int startPos, endPos;
> StringBuffer text = new StringBuffer();
> while((line = reader.readLine()) != null   ){
>     startPos = line.indexOf(">");
>     endPos = line.indexOf("<");
>     if(startPos >0 && endPos > startPos)
>           text.append(line.substring(startPos, endPos));
> }
> 
> This is just a sample code that should work if you have just one tag per 
> line in the HTML file.
> This can be a start point for you.
> 
>   Hope it helps,
> 
>  Best,
> 
>  Sergiu
> 
> >I am wondering if somebody knows a piece of simple sourcecode with low
> >requirement which is running under this tense specification.
> >
> >Thank you all,
> >Karl
> >
> >  
> >
> >>No one has yet mentioned using ParserDelegator and ParserCallback that 
> >>are part of HTMLEditorKit in Swing.  I have been successfully using 
> >>these classes to parse out the text of an HTML file.  You just need to 
> >>extend HTMLEditorKit.ParserCallback and override the various methods 
> >>that are called when different tags are encountered.
> >>
> >>
> >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> >>
> >>    
> >>
> >>>Three HTML parsers(Lucene web application
> >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>Lucene FAQ
> >>>1.3.27.Which is the best?Can it filter tags that are
> >>>auto-created by MS-word 'Save As HTML files' function?
> >>>      
> >>>
> >>-- 
> >>Bill Tschumy
> >>Otherwise -- Austin, TX
> >>http://www.otherwise.com
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse f�r Mail, Message, More +++

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.

Karl Koch wrote:

>I appologise in advance, if some of my writing here has been said before.
>The last three answers to my question have been suggesting pattern matching
>solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
>is something I cannot use since I work with Java 1.1 on a PDA.
>  
>
I see,

In this case you can read line by line your HTML file and then write 
something like this:

String line;
int startPos, endPos;
StringBuffer text = new StringBuffer();
while((line = reader.readLine()) != null   ){
    startPos = line.indexOf(">");
    endPos = line.indexOf("<");
    if(startPos >0 && endPos > startPos)
          text.append(line.substring(startPos, endPos));
}

This is just a sample code that should work if you have just one tag per 
line in the HTML file.
This can be a start point for you.

  Hope it helps,

 Best,

 Sergiu

>I am wondering if somebody knows a piece of simple sourcecode with low
>requirement which is running under this tense specification.
>
>Thank you all,
>Karl
>
>  
>
>>No one has yet mentioned using ParserDelegator and ParserCallback that 
>>are part of HTMLEditorKit in Swing.  I have been successfully using 
>>these classes to parse out the text of an HTML file.  You just need to 
>>extend HTMLEditorKit.ParserCallback and override the various methods 
>>that are called when different tags are encountered.
>>
>>
>>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>>
>>    
>>
>>>Three HTML parsers(Lucene web application
>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>Lucene FAQ
>>>1.3.27.Which is the best?Can it filter tags that are
>>>auto-created by MS-word 'Save As HTML files' function?
>>>      
>>>
>>-- 
>>Bill Tschumy
>>Otherwise -- Austin, TX
>>http://www.otherwise.com
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Karl,

Two things, try to experiment with both:

1) I would try to write a lexical scanner that strips HTML tags, much 
like the regular expression does. Java lexical scanner packages produce 
nice pure Java classes that seldom use any advanced API, so they should 
work on Java 1.1. They are simple state machines with states encoded in 
integers -- this should work like a charm, be fast and small.

2) Write a parser yourself. Having a regular expression it isn't that 
difficult to do... :)

D.

Karl Koch wrote:
> I appologise in advance, if some of my writing here has been said before.
> The last three answers to my question have been suggesting pattern matching
> solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
> is something I cannot use since I work with Java 1.1 on a PDA.
> 
> I am wondering if somebody knows a piece of simple sourcecode with low
> requirement which is running under this tense specification.
> 
> Thank you all,
> Karl
> 
> 
>>No one has yet mentioned using ParserDelegator and ParserCallback that 
>>are part of HTMLEditorKit in Swing.  I have been successfully using 
>>these classes to parse out the text of an HTML file.  You just need to 
>>extend HTMLEditorKit.ParserCallback and override the various methods 
>>that are called when different tags are encountered.
>>
>>
>>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>>
>>
>>>Three HTML parsers(Lucene web application
>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>Lucene FAQ
>>>1.3.27.Which is the best?Can it filter tags that are
>>>auto-created by MS-word 'Save As HTML files' function?
>>
>>-- 
>>Bill Tschumy
>>Otherwise -- Austin, TX
>>http://www.otherwise.com
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Karl Koch <Th...@gmx.net>.

I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,
Karl

> No one has yet mentioned using ParserDelegator and ParserCallback that 
> are part of HTMLEditorKit in Swing.  I have been successfully using 
> these classes to parse out the text of an HTML file.  You just need to 
> extend HTMLEditorKit.ParserCallback and override the various methods 
> that are called when different tags are encountered.
> 
> 
> On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> 
> > Three HTML parsers(Lucene web application
> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> > Lucene FAQ
> > 1.3.27.Which is the best?Can it filter tags that are
> > auto-created by MS-word 'Save As HTML files' function?
> -- 
> Bill Tschumy
> Otherwise -- Austin, TX
> http://www.otherwise.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Ian Soboroff <ia...@nist.gov>.

Oops.  It's in the Google cache and also the Internet Archive Wayback
machine.  I'll drop the original author a note to let him know that
his links are stale.

http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

Ian

"Karl Koch" <Th...@gmx.net> writes:

> The link does not work.
>
>> 
>> One which we've been using can be found at:
>> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
>> 
>> We absolutely need to be able to recover gracefully from malformed
>> HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
>> failed this criterion when we started our effort.  The above one is
>> kind of SAX-y but doesn't fall over at the sight of a real web page
>> ;-)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Karl Koch <Th...@gmx.net>.

The link does not work.

> 
> One which we've been using can be found at:
> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
> 
> We absolutely need to be able to recover gracefully from malformed
> HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
> failed this criterion when we started our effort.  The above one is
> kind of SAX-y but doesn't fall over at the sight of a real web page
> ;-)
> 
> Ian
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
DSL Komplett von GMX +++ Superg�nstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Ian Soboroff <ia...@nist.gov>.

One which we've been using can be found at:
http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

We absolutely need to be able to recover gracefully from malformed
HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
failed this criterion when we started our effort.  The above one is
kind of SAX-y but doesn't fall over at the sight of a real web page
;-)

Ian


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by aurora <au...@gmail.com>.

For all parser suggestion I think there is one important attribute. Some  
parsers returns data provide that the input HTML is sensible. Some parsers  
is designed to be most flexible as tolerant as it can be. If the input is  
clean and controlled the former class is sufficient. Even some regular  
expression may be sufficient. (I that's the original poster wants). If you  
are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it  
because it failed to parse about 15% documents (problem handling nested  
quotes like onclick="alert('hi')").

> No one has yet mentioned using ParserDelegator and ParserCallback that  
> are part of HTMLEditorKit in Swing.  I have been successfully using  
> these classes to parse out the text of an HTML file.  You just need to  
> extend HTMLEditorKit.ParserCallback and override the various methods  
> that are called when different tags are encountered.
>
>
> On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>
>> Three HTML parsers(Lucene web application
>> demo,CyberNeko HTML Parser,JTidy) are mentioned in
>> Lucene FAQ
>> 1.3.27.Which is the best?Can it filter tags that are
>> auto-created by MS-word 'Save As HTML files' function?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Bill Tschumy <bi...@otherwise.com>.

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

> Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-created by MS-word 'Save As HTML files' function?
-- 
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: which HTML parser is better?

Posted by Michael Giles <mg...@furl.net>.

When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.

-Mike

Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>
>_________________________________________________________
>Do You Yahoo!?
>150万曲MP3疯狂搜，带您闯入音乐殿堂
>http://music.yisou.com/
>美女明星应有尽有，搜遍美图、艳图和酷图
>http://image.yisou.com
>1G就是1000兆，雅虎电邮自助扩容！
>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org