You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2002/05/12 16:50:55 UTC

[Announce] NekoHTML 0.6 Available

Well, I've been quite busy lately working on the NekoHTML 
parser for Xerces2 and I'm pleased to announce the latest 
version, NekoHTML 0.6, is available for download at the
following location:

  http://www.apache.org/~andyc/nekohtml/doc/index.html

There are a *lot* of changes and additions in this version.
Here's a list of what's new:

  * Added property to allow custom document filters to be 
    appended to the default NekoHTML parser pipeline; 
  * added convenience filters for serializing HTML documents 
    and removing elements from the document event stream; 
  * added samples to demonstrate the filtering feature; 
  * added experimental functionality to allow applications 
    to dynamically insert content into the HTML document 
    stream; 
  * added a minimal Xerces2 Jar file containing just the 
    files required for using the HTMLConfiguration class 
    directly to alleviate full dependence on Xerces2 
    distribution; 
  * applied patch from Serge Proskuryakov to fix handling 
    of misplaced <title> within <body>; 
  * fixed minor tag balancing bug; and 
  * re-organized and added new documentation.

The coolest features added to this version are the ability
to append custom document filters to the parsing pipeline
by setting a property; and the (currently experimental)
ability to dynamically insert new content into the document
parsing stream.

I have included a variety of simple (but quite useful)
samples of the new filter functionality. One filter is an
HTML serializer which has the ability to change the encoding
of the document as it's being serialized -- this includes
changing the META[@http-equiv='content-type']/@content tag
on the way out. 

Another filter strips elements (and attrs) from the document 
stream. This one is useful for stripping out everything but 
rich-text elements, for example. I'm thinking about writing
a related filter that converts the remaining rich-text
elements to text which would be a good way of producing
vanilla text documents that retain the "richness".

I have also included an identity transform which basically
filters out all of the events synthesized by the tag
balancer. Why would you want to do this? Well, you might
want to receive all of the warnings/errors reported by
the tag balancer without wanting the elements that were
generated to make the document well-formed.

Adding custom filters is incredibly easy. Simply make an
array of objects that implement the XMLDocumentFilter 
interface from XNI and set the appropriate property on
the parser. For example:

  ElementRemover remover = new ElementRemover();
  remover.acceptElement("b", null);
  remover.acceptElement("i", null);
  remover.acceptElement("u", null);
  remover.acceptElement("a", new String[] { "href" });

  XMLDocumentFilter[] filters = { remover, new Writer() };

  SAXParser parser = new SAXParser();
  parser.setProperty("http://cyberneko.org/html/properties/filters",
                     filters);

But this is all covered in the docs which I have
expanded and improved. I've separated the existing docs
into multiple pages and added a bunch of information 
about the filters, etc. And now it's finally all on my
public website so you don't have to download the package
to peruse the information.

The other big feature (which took me longer to implement
today than I thought) is the ability to insert content
into the document parsing stream. I've labeled it as
"experimental" because I'm not entirely convinced yet
that it's a good way to do it -- I'm referring to the
public API here.

There is now a method on the HTMLConfiguration called
"pushInputSource" which allows you to push a new input
source onto the stack of readers. This is the same thing
we do in the Xerces2 implementation (albeit a more round-
about way) but it has the net effect of changing where
the parser is scanning. When the end of that stream is
reached, the parser pops it off and continues where it
left off. Pretty cool.

There is a new sample call Script in the src/sample/
directory that shows how it is used. Again, there's more
information in the new documentation.

Like I said, it's experimental because I may think of
a "cleaner" way of allowing applications to do this.
But then again, if it works why fix it. So I'll just
have to see how it goes.

And lastly, I wanted to mention that this distribution
now includes a minimal Xerces Jar file for convenience.
This Jar just contains the XNI framework and the Xerces2
utility classes that are used by the NekoHTML impl. So,
if you are using the HTMLConfiguration class directly
(and *not* using the DOMParser or SAXParser which have
more dependencies), then you can just use the NekoHTML
Jar file and the minimal Xerces Jar file. This greatly
reduces the size of the required files. 

I see a huge savings because I write directly to XNI.
Compare for yourself:

    42k nekohtml.jar
    35k lib/xercesMinimal.jar

   131k lib/xmlParserAPIs.jar
  1760k lib/xercesImpl.jar

Okay, that's all for now. Enjoy!

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Nick Kew <ni...@webthing.com>.

On Mon, 13 May 2002, Matt Sergeant wrote:

> On Monday 13 May 2002 7:59 am, Nick Kew wrote:
> > Perhaps I should add that my interest in this is with regard to a
> > validating (and rather more) processor that will deal with both HTML and
> > XML using a common UI.
>
> libxml2?

I'm already using that (my Accessibility Proxy uses the HTML parser
from libxml2).  The advantage of Xerces is that its validation
is quite a lot more complete than that of libxml2.

The other library I'm using is of course OpenSP.  The problem we are
looking to solve with Xerces is that OpenSP doesn't deal with XML
namespaces or schema.

-- 
Nick Kew

Available for contract work - Programming, Unix, Networking, Markup, etc.

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Matt Sergeant <ma...@sergeant.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday 13 May 2002 7:59 am, Nick Kew wrote:
> Perhaps I should add that my interest in this is with regard to a
> validating (and rather more) processor that will deal with both HTML and
> XML using a common UI.

libxml2?

- -- 
<:->get a SMart net</:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE832j1VBc71ct6OywRAvpEAKCA92Bcq1teMnf6u4ZrCSq07zCwcACeMrJB
IzRHn34XRPijSVO+/APUo88=
=Rwg/
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Nick Kew <ni...@webthing.com>.

Perhaps I should add that my interest in this is with regard to a
validating (and rather more) processor that will deal with both HTML and
XML using a common UI.

-- 
Nick Kew


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Nick Kew <ni...@webthing.com>.

On Mon, 13 May 2002, Andy Clark wrote:

> Nick Kew wrote:
> > Has anyone given any thought to how well/easily this would port to
> > run under Xerces-C++?  I might[1] take a look at that myself, so if
> > anyone has comments or experiences with it, please tell me!
>
> Aside from the fact that NekoHTML is written to work well
> within the XNI framework, the parser was written from the
> ground up with little reliance on the Xerces2 implementation.
> So porting to C++ should not be too difficult.

Well, that's a start.  Though porting to C++ and porting to Xerces-C++
is by no means the same thing, given the substantial infrastructure
implied by the latter.

But your reply suggests another faint possibility: that it might be
ported to C++ as mix-and-match.

> But your question begs another: is anyone in the Xerces-C
> community considering re-working the C++ parser so that it
> is also built upon the XNI framework?

I've no idea - I had to google to expand that TLA.  Since [Xerces|Xalan]-C
is amongst the least portable software to appear as open-source in the
past decade or so, the idea of a "native interface" might seem rather OTT.

-- 
Nick Kew

Available for contract work - Programming, Unix, Networking, Markup, etc.

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Andy Clark <an...@apache.org>.

Nick Kew wrote:
> Has anyone given any thought to how well/easily this would port to
> run under Xerces-C++?  I might[1] take a look at that myself, so if
> anyone has comments or experiences with it, please tell me!

Aside from the fact that NekoHTML is written to work well
within the XNI framework, the parser was written from the
ground up with little reliance on the Xerces2 implementation.
So porting to C++ should not be too difficult.

But your question begs another: is anyone in the Xerces-C
community considering re-working the C++ parser so that it
is also built upon the XNI framework?

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [Announce] NekoHTML 0.6 Available

Posted by Nick Kew <ni...@webthing.com>.

Has anyone given any thought to how well/easily this would port to
run under Xerces-C++?  I might[1] take a look at that myself, so if
anyone has comments or experiences with it, please tell me!


[1] No promises!

-- 
Nick Kew

Available for contract work - Programming, Unix, Networking, Markup, etc.



---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org