You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@abdera.apache.org by James M Snell <ja...@gmail.com> on 2006/08/08 00:01:12 UTC

Parsing HTML

I've put together a fairly simple HTML->Abdera/Axiom impl based on the
Tagsoup parser [1].  It implements the Abdera Parser interface and
creates a Document<Element> model that represents HTML as well-formed
XHTML content.  Further, it supports the ParseFilter mechansism so we
can filter out unsafe HTML content (e.g. script tags).

For example:

    Parser parser = new HtmlParser();
    ParserOptions options = parser.getDefaultParserOptions();
    options.setParseFilter(new SafeContentWhiteListParseFilter());

    String h = "foo<p style='background-color:blue'>This
<script>alert('foo');</script> <a href='this is foo'>is</a> foo
<b>bar</b> &nbsp;&raquo;&lt;foo&gt; hello";

    ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes());

    Document<Element> doc = parser.parse(in, (URI)null, options);

    doc.getRoot().writeTo(System.out);

// Outputs
<xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml">foo<xhtml:p>This
alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo
<xhtml:b>bar</xhtml:b>  »&lt;foo&gt; hello</xhtml:p></xhtml:div>

There are still little bits of wierdness, but for the most part it seems
to work really well.  On the downside, I'm not sure if the Tagsoup
license is compatible with the Apache license, otherwise I'd check this
in to the extensions module.

(and oh, btw, so far this has been implemented as a single class with
only 189 lines of code, most of which are formatting :-) ....)

- James

[1] http://home.ccil.org/~cowan/XML/tagsoup/


Re: Parsing HTML

Posted by Charles Adkins <fo...@gmail.com>.
On 8/7/06, James M Snell <ja...@gmail.com> wrote:
>
> I've put together a fairly simple HTML->Abdera/Axiom impl based on the
>
Tagsoup parser [1].  It implements the Abdera Parser interface and
> creates a Document<Element> model that represents HTML as well-formed
> XHTML content.  Further, it supports the ParseFilter mechansism so we can
> filter out unsafe HTML content (e.g. script tags).


sweet.
that's gonna be extremely useful.
-charles

Re: Parsing HTML

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 8/8/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> On 8/7/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
>
> > That's pretty slick.  I'll look into the licensing issue.
>
> Just to keep people in the loop, I talked to Cliff Schmidt (Mister ASF
> Legal) about this, and he's going to look at getting the Academic Free
> License added to the "will be final real soon" third party license
> policy.  Odds are good that it'll be considered reasonable for us to
> use, but for now it's still up in the air.

Just FYI, I finally cornered Cliff Schmidt about this, and he says we
can go ahead and use tagsoup.  There's a small possibility some people
might object to its license due to clause 9 of the AFL, but most
people seem to be ok with it, and it'll be in the final version of his
3rd party code policy document.  For now, since there's no official
policy on such things anyway we can just go ahead and use it.

-garrett

Re: Parsing HTML

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 8/7/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:

> That's pretty slick.  I'll look into the licensing issue.

Just to keep people in the loop, I talked to Cliff Schmidt (Mister ASF
Legal) about this, and he's going to look at getting the Academic Free
License added to the "will be final real soon" third party license
policy.  Odds are good that it'll be considered reasonable for us to
use, but for now it's still up in the air.

-garrett

Re: Parsing HTML

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 8/7/06, James M Snell <ja...@gmail.com> wrote:
> I've put together a fairly simple HTML->Abdera/Axiom impl based on the
> Tagsoup parser [1].  It implements the Abdera Parser interface and
> creates a Document<Element> model that represents HTML as well-formed
> XHTML content.  Further, it supports the ParseFilter mechansism so we
> can filter out unsafe HTML content (e.g. script tags).
>
> For example:
>
>     Parser parser = new HtmlParser();
>     ParserOptions options = parser.getDefaultParserOptions();
>     options.setParseFilter(new SafeContentWhiteListParseFilter());
>
>     String h = "foo<p style='background-color:blue'>This
> <script>alert('foo');</script> <a href='this is foo'>is</a> foo
> <b>bar</b> &nbsp;&raquo;&lt;foo&gt; hello";
>
>     ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes());
>
>     Document<Element> doc = parser.parse(in, (URI)null, options);
>
>     doc.getRoot().writeTo(System.out);
>
> // Outputs
> <xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml">foo<xhtml:p>This
> alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo
> <xhtml:b>bar</xhtml:b>  »&lt;foo&gt; hello</xhtml:p></xhtml:div>
>
> There are still little bits of wierdness, but for the most part it seems
> to work really well.  On the downside, I'm not sure if the Tagsoup
> license is compatible with the Apache license, otherwise I'd check this
> in to the extensions module.
>
> (and oh, btw, so far this has been implemented as a single class with
> only 189 lines of code, most of which are formatting :-) ....)

That's pretty slick.  I'll look into the licensing issue.

-garrett