You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Takumi Fujiwara <tr...@yahoo.com> on 2004/03/05 19:53:41 UTC

Neko HTML Parser Question

Does/Will NekoHTL parser work any JAXP parser?

e.g. Piccolo at http://piccolo.sourceforge.net?

I think Piccolo is faster than Xerces. So I would like
to take advantage of Piccolo for parser/correcting
HTML.

Thank you.



__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you�re looking for faster
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: Neko HTML Parser Question

Posted by Elliotte Rusty Harold <el...@metalab.unc.edu>.

At 9:49 PM -0800 3/7/04, Andy Clark wrote:

>The perceived slowness is because Xerces is a conformant
>XML parser. "Faster" XML parsers usually gain from not
>implementing validation or only supporting a limited
>number of character encodings. So keep this in mind when
>evaluating parsers and pick the parser for your
>application appropriately.

The conformance does slow down Xerces relative to some other less 
conformant parsers like Piccolo, but I don't think it's for the 
reasons you cite. Extra encoding support should have no affect on 
speed. After all, the encodings you aren't using don't cost you 
anything. Likewise, validation should be free if you aren't using it. 
Dropping these would save size, but I doubt it really saves time.

The cost of conformance is in fully implementing all the weird little 
niches of XML, plus all the edge cases of various APIs. Piccolo 
mishandles a lot of well-formed and malformed XML that's just a 
little bit off the beaten track, but it does at least try to handle 
XML. A lot of other parsers that claim speed advantages  only get 
this by assuming XML is less complicated than it really is, and 
ignoring the internal DTD subset, assuming mixed content doesn't 
exist, or making other very questionable choices.
-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: Neko HTML Parser Question

Posted by Andy Clark <an...@apache.org>.

Takumi Fujiwara wrote:
> Does/Will NekoHTL parser work any JAXP parser?
 >
> e.g. Piccolo at http://piccolo.sourceforge.net?
> 
> I think Piccolo is faster than Xerces. So I would like
> to take advantage of Piccolo for parser/correcting
> HTML.

NekoHTML requires the Xerces Native Interface (XNI), not
Xerces (per se). If you instantiate a NekoHTML DOM or SAX
parser, you will get a subclass of the Xerces parser but
that does *not* mean that you are using Xerces. NekoHTML
swaps the parsing pipeline in the DOM/SAX parser with its
own. So the scanning and tag-balancing operations are
strictly NekoHTML. Therefore, if NekoHTML is slow, then
that's my fault, not Xerces. :)

The perceived slowness is because Xerces is a conformant
XML parser. "Faster" XML parsers usually gain from not
implementing validation or only supporting a limited
number of character encodings. So keep this in mind when
evaluating parsers and pick the parser for your
application appropriately.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org