You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2002/04/02 12:26:57 UTC

HTML Parser Update Available

Since the topic of parsing HTML has come up again recently, I 
posted a new version of the NekoHTML parser for Xerces2 to my
website. This version fixes a few bugs and adds some convenient 
DOM and SAX parser classes so it's a little easier to use 
directly. 

If anyone's interested, it's located at the following URL:

  http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Matt Sergeant <ma...@sergeant.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 10 April 2002 5:14 pm, Harald Hett wrote:
> Recently I searched the web for a HTML-Parser that is capable to parse
> dirty HTML in any way. But all that I found did not really convince me.
> Only JTidy seemed to fit. But none of those solutions produces a
> DOM-tree in the end, that can be easily modified by using the DOM-Api or
> a XSLT-Stylesheet. That is a nice and interesting feature of CyberNeko
> and makes it interesting for a lot of programmers.

Libxml2 parses dirty HTML and produces a DOM tree suitable for passing to 
XSLT, or for turning into XHTML, or re-rendering to HTML.

- -- 
<:->get a SMart net</:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAjy4DUsACgkQVBc71ct6Oyzv6gCgpeTpMLE3hbCVvBV858+7DMNZ
TQwAnRbpCByXZ3WcyOKO3tpEKvW5kxhl
=GLsL
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Nicola Ken Barozzi <ni...@apache.org>.

From: "Andy Clark" <an...@apache.org>

> There hasn't been an overwelming demand for it. Although, a
> few people responded and said it would generally be a "good"
> thing to include with Xerces. Certainly if there is a need
> for it, I wouldn't mind rolling it into the codebase -- it's
> actually quite small so it wouldn't add a lot to the source
> or Jar file(s).
>
> What do people think?

It would be great for us on the Cocoon Project, since we could do away with
jtidy that is slow and resilient to change, since it depends from the C
version.

+1

If you want I can cc this to loads of developers to have them ask you to put
it in Xerces... ;-)

--
Nicola Ken Barozzi                   nicolaken@apache.org
            - verba volant, scripta manent -
   (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Andy Clark <an...@apache.org>.

Harald Hett wrote:
> Again, I think the value should not lead to misunderstandings. Maybe
> "nochange" is not correct for the kind of case manipulation the NekoHTML
> parser must do, but "balance" is not either, I think. Hm, "match" fits
> much better. Hm, what about "trim"?

I don't like "trim" because there's no trimming happening... I
think I'll go with "match". Besides, this is just a documentation
issue. In the code, I only look for "upper" and "lower" -- all
other values are treated as "match". So the following would be
just fine as well:

  parser.setProperty("http://cyberneko.org/html/properties/names/elems",

"only-match-the-end-tag-to-the-start-tag-you-dolt");

:)

> > But I'm getting tired so I'll probably do that tomorrow.
> Well, the weekend starts now!

I was writing the last message at about 9:30pm on Friday here
in Tokyo. I was feeling a little anxious to get out of the
house for awhile so I just made sure that the casing code
worked and then went to the pub. :) Today I'll work on adding
the infoset augmentations, updating the docs, and pushing a
new version out the door. [Disclaimer: but don't kill me if
it doesn't get out there until tomorrow.]

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Harald Hett <h....@gis-systemhaus.de>.

Andy Clark wrote:
> But I am thinking of changing it to "balance" or "match". The
> reason why is because even if the NekoHTML parser accepts your
> element names as-is, it still MUST modify the end tag name to
> match the start tag name. This is required to produce a well-
> formed XML document.
Again, I think the value should not lead to misunderstandings. Maybe
"nochange" is not correct for the kind of case manipulation the NekoHTML
parser must do, but "balance" is not either, I think. Hm, "match" fits
much better. Hm, what about "trim"? 

> I put in the name-case properties already as well as the
> the property to set the default encoding. Next will be
> the infoset augmentations to indicate "synthesized" events.
> After that, it's just a matter of updating the docs and
> releasing the new version.
> 
> But I'm getting tired so I'll probably do that tomorrow.
Well, the weekend starts now! 

Bye

-- 
Harald Hett <h....@gis-systemhaus.de>
Gesellschaft für integrierte Systemplanung

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Andy Clark <an...@apache.org>.

Harald Hett wrote:
> The value "nochange" sounds good. I think it would be better to use this
> value instead of "default" or "no". It leaves no doubt about, what the
> parser does. Whereas the value "default" could be misunderstood in "let
> the parser do what he defaults to", which can also mean "convert it in a
> particular case".

Very good point! 

But I am thinking of changing it to "balance" or "match". The 
reason why is because even if the NekoHTML parser accepts your 
element names as-is, it still MUST modify the end tag name to 
match the start tag name. This is required to produce a well-
formed XML document.

For example:

  <B>bold</b>

This should come out as (using NSGMLS format):

  (B
  "bold
  )B

and NOT the following:

  (B
  "bold
  )b

What do you think about the name change? Which one do you
prefer? Or do you have another suggestion?

> Thank you for informing. I am anxious about what comes out.
> 
> Happy Coding!

I put in the name-case properties already as well as the
the property to set the default encoding. Next will be
the infoset augmentations to indicate "synthesized" events.
After that, it's just a matter of updating the docs and
releasing the new version.

But I'm getting tired so I'll probably do that tomorrow.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Harald Hett <h....@gis-systemhaus.de>.

Andy Clark wrote:
> This weekend I'll be working on adding some minor features to
> NekoHTML. During that time, I'll add some properties to allow
> the application to control how NekoHTML handles element and
> attribute names from the source document. I'm currently
> thinking of the following properties:
> 
>   "http://cyberneko.org/html/properties/names/elems"
>   "http://cyberneko.org/html/properties/names/attrs"
> 
> each with the following allowed values:
> 
>   { "upper", "lower", "default" }
> 
> Since I changed the property names, your request to change
> the "default" value to "no" doesn't apply anymore. So I'm
> still using "default". Does this make more sense or should
> it be changed to something else entirely, like "nochange" or
> "specified" or ...?
> 
I am looking forward for the result. 
The value "nochange" sounds good. I think it would be better to use this
value instead of "default" or "no". It leaves no doubt about, what the
parser does. Whereas the value "default" could be misunderstood in "let
the parser do what he defaults to", which can also mean "convert it in a
particular case". 

> directive specifying the encoding. This is probably because
> they falsely assume that only people with Japanese systems
> are visiting their web site. (It's not just English-speaking
> programmers that ignore globalization, people! ;)
Unfortunately there is a lot more stuff, that is being assumed falsely
by the people who create web-sites ;-)

> We are having a discussion in a separate thread regarding this
> very topic. Depending on the result of that discussion, NekoHTML
> will either be rolled into Xerces OR become a separate project
> of its own. In the latter case, it's not clear whether separate
> projects are included in the Xerces codebase (but kept separate)
> or hosted elsewhere.
> 
> I have no problem with NekoHTML remaining separate but it would
> be nice to have links to related projects from the Xerces page,
> as you suggest.
Thank you for informing. I am anxious about what comes out. 

Happy Coding! 

-- 
Harald Hett <h....@gis-systemhaus.de>
Gesellschaft für integrierte Systemplanung

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Andy Clark <an...@apache.org>.

Harald Hett wrote:
> > 2) A property, for example:
> >
> >   "http://cyberneko.org/html/names/modify"  { "upper", "lower",
> > "default" }
> >
> > [These are just examples. I might want to modify the names.]
> >
> A property would be great, but with "no" instead of "default".

This weekend I'll be working on adding some minor features to
NekoHTML. During that time, I'll add some properties to allow
the application to control how NekoHTML handles element and
attribute names from the source document. I'm currently
thinking of the following properties:

  "http://cyberneko.org/html/properties/names/elems"
  "http://cyberneko.org/html/properties/names/attrs"

each with the following allowed values:

  { "upper", "lower", "default" }

Since I changed the property names, your request to change 
the "default" value to "no" doesn't apply anymore. So I'm
still using "default". Does this make more sense or should
it be changed to something else entirely, like "nochange" or
"specified" or ...?

In addition, I'll be adding code to allow the application
to set which encoding to use by default. Right now the default
is Cp1252 which is the standard Windows locale (on English
machines). But I'm running into the situation where I'm
parsing Japanese HTML pages that do not have an http-equiv
directive specifying the encoding. This is probably because
they falsely assume that only people with Japanese systems 
are visiting their web site. (It's not just English-speaking
programmers that ignore globalization, people! ;)

In this case, I need to add some kind of intelligence to my
app so the parser uses a different default encoding. For
example, if the domain ends with ".jp" then assume "EUC-JP" 
or "Shift_JIS" encoding. <tangent>It would be awesome if
there was an "AutoDetect" Japanese decoder for Java. But
until then I'll just have to pick one.</tangent> Anyway, 
this would be set through a property as well.

I'm also thinking of adding code to allow the tag balancer
to pass infoset augmentations along the pipeline. Specifically,
information regarding whether the event info is specified in
the document or "synthesized" by the tag balancer. This would
allow people at the end of the pipeline to tell exactly what
was really in the source document. (For performance reasons,
though, this feature would be "off" by default.)

There'll be enough minor changes to warrant boosting the
version number to "0.4.0" instead of "0.3.4". Just a heads
up, in case anyone cares.

> > > Is it planned to include NekoHTML into the Xerces release?
> >
> [...]
> Unfortunately the link to CyberNeko is not well known in the public. I
> only got notice of it by reading your recent postings in
> general@xml.apache.org. I think it should be either included in the
> Xerces distribution or made accessible from the xerces homepage.

We are having a discussion in a separate thread regarding this 
very topic. Depending on the result of that discussion, NekoHTML
will either be rolled into Xerces OR become a separate project
of its own. In the latter case, it's not clear whether separate
projects are included in the Xerces codebase (but kept separate)
or hosted elsewhere.

I have no problem with NekoHTML remaining separate but it would
be nice to have links to related projects from the Xerces page,
as you suggest.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Harald Hett <h....@gis-systemhaus.de>.

> 2) A property, for example:
> 
>   "http://cyberneko.org/html/names/modify"  { "upper", "lower",
> "default" }
> 
> [These are just examples. I might want to modify the names.]
> 
A property would be great, but with "no" instead of "default".

> > Is it planned to include NekoHTML into the Xerces release?
> 
> There hasn't been an overwelming demand for it. Although, a
> few people responded and said it would generally be a "good"
> thing to include with Xerces. Certainly if there is a need
> for it, I wouldn't mind rolling it into the codebase -- it's
> actually quite small so it wouldn't add a lot to the source
> or Jar file(s).
> 
> What do people think?
> 

Recently I searched the web for a HTML-Parser that is capable to parse
dirty HTML in any way. But all that I found did not really convince me.
Only JTidy seemed to fit. But none of those solutions produces a
DOM-tree in the end, that can be easily modified by using the DOM-Api or
a XSLT-Stylesheet. That is a nice and interesting feature of CyberNeko
and makes it interesting for a lot of programmers.

Unfortunately the link to CyberNeko is not well known in the public. I
only got notice of it by reading your recent postings in
general@xml.apache.org. I think it should be either included in the
Xerces distribution or made accessible from the xerces homepage.

Bye
-- 
Harald Hett <h....@gis-systemhaus.de>
Gesellschaft für integrierte Systemplanung

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Andy Clark <an...@apache.org>.

Harald Hett wrote:
> Why are the elements converted to upper case? Wouldn't it be nice to
> be able to change this feature between the three states (1) do nothing,
> (2) convert to uppercase, (3) convert to lowercase?

I was thinking about this actually. It would be very easy to
add, actually -- it's just a matter of picking the appropriate
way to handle this. Since there are three states, a feature (as
in the "setFeature" method) doesn't seem to fit. But it would
seem weird to use a property. Or maybe not...

Which would you rather prefer?

1) Two features, for example:

  "http://cyberneko.org/html/names/modify"     { true, false }
  "http://cyberneko.org/html/names/uppercase"  { true, false }

OR...

2) A property, for example:

  "http://cyberneko.org/html/names/modify"  { "upper", "lower",
"default" }

[These are just examples. I might want to modify the names.]

> Is it planned to include NekoHTML into the Xerces release?

There hasn't been an overwelming demand for it. Although, a 
few people responded and said it would generally be a "good" 
thing to include with Xerces. Certainly if there is a need
for it, I wouldn't mind rolling it into the codebase -- it's
actually quite small so it wouldn't add a lot to the source
or Jar file(s).

What do people think?

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: HTML Parser Update Available

Posted by Harald Hett <h....@gis-systemhaus.de>.

Hi, Andy!

Although NekoHTML is a neat tool, there are still some questions:

Why are the elements converted to upper case? Wouldn't it be nice to
be able to change this feature between the three states (1) do nothing, 
(2) convert to uppercase, (3) convert to lowercase?

Is it planned to include NekoHTML into the Xerces release?

Thanks for answering!

-- 
Harald Hett <h....@gis-systemhaus.de>
Gesellschaft für integrierte Systemplanung

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org