You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2002/02/25 10:49:12 UTC

NekoHTML Parser License Change

I've had a rather limited response to my NekoHTML parser based
on the Xerces Native Interface (XNI). So I'm still not sure if
it should take up space as part of the standard Xerces-J 
project. 

However... I've had enough of an interest that it makes sense 
to make it more freely available for use. Therefore, I have 
changed the license agreement to an Apache style license. You 
can download the latest release from the following URL:

  http://www.apache.org/~andyc/

If interest increases then we could consider moving it into 
the Xerces-J tree or having it hosted from possibly SourceForge.
In short, development will continue as warranted. There are a
few things in JTidy that look useful so that may be the next
step I take with NekoHTML.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by Robert Koberg <ro...@koberg.com>.

I am very interested in it too, but have not had time to work on it.

I am thinking of trying to target a website and crawl through the pages,
transform it into XML (as much as possible...) and deposit it somewhere.

Thanks for doing this Andy!

----- Original Message -----
From: <fr...@ontosys.com>
To: <ge...@xml.apache.org>; <xe...@xml.apache.org>;
<xe...@xml.apache.org>
Cc: "Andy Clark" <an...@apache.org>
Sent: Monday, February 25, 2002 6:47 AM
Subject: Re: NekoHTML Parser License Change


> Andy's NekoHTML parser has worked well for me in a small project where
> I needed to scrape some data from a set of HTML pages.  With NekoHTML
> as the front end I was able to use an XSLT stylesheet to extract that
> data directly from the HTML pages.
>
> NekoHTML also allowed me to write a simple HTML transformation that I
> find useful when analyzing HTML page layouts:  adding a small colored
> border to each TABLE so that the table boundaries are visible.  This
> transformation requires only a few lines of XSLT added to a standard
> "identity" transformation.
>
> I expect that NekoHTML would make it easy to translate HTML code into
> XHTML format.  I have encountered a few tag-balancing glitches, where
> NekoHTML struggles to accommodate ill-formed HTML code much as the
> popular browsers do, but overall it has been very solid.
>
> NekoHTML is very easy to use.  For the most part it is a transparent
> addition to a standard Xerces/Xalan configuration, and all the usual
> APIs -- including JAXP -- seem to work as expected.
>
> Nice work Andy.  Thank you for making NekoHTML available.



---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by fr...@ontosys.com.

Andy's NekoHTML parser has worked well for me in a small project where
I needed to scrape some data from a set of HTML pages.  With NekoHTML
as the front end I was able to use an XSLT stylesheet to extract that
data directly from the HTML pages.

NekoHTML also allowed me to write a simple HTML transformation that I
find useful when analyzing HTML page layouts:  adding a small colored
border to each TABLE so that the table boundaries are visible.  This
transformation requires only a few lines of XSLT added to a standard
"identity" transformation.

I expect that NekoHTML would make it easy to translate HTML code into
XHTML format.  I have encountered a few tag-balancing glitches, where
NekoHTML struggles to accommodate ill-formed HTML code much as the
popular browsers do, but overall it has been very solid.

NekoHTML is very easy to use.  For the most part it is a transparent
addition to a standard Xerces/Xalan configuration, and all the usual
APIs -- including JAXP -- seem to work as expected.

Nice work Andy.  Thank you for making NekoHTML available.


-- 
Fred Yankowski      fred@ontosys.com           tel: +1.630.879.1312
OntoSys, Inc	    PGP keyID: 7B449345        fax: +1.630.879.1370
www.ontosys.com     38W242 Deerpath Rd, Batavia, IL 60510-9461, USA

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by fr...@ontosys.com.

Andy's NekoHTML parser has worked well for me in a small project where
I needed to scrape some data from a set of HTML pages.  With NekoHTML
as the front end I was able to use an XSLT stylesheet to extract that
data directly from the HTML pages.

NekoHTML also allowed me to write a simple HTML transformation that I
find useful when analyzing HTML page layouts:  adding a small colored
border to each TABLE so that the table boundaries are visible.  This
transformation requires only a few lines of XSLT added to a standard
"identity" transformation.

I expect that NekoHTML would make it easy to translate HTML code into
XHTML format.  I have encountered a few tag-balancing glitches, where
NekoHTML struggles to accommodate ill-formed HTML code much as the
popular browsers do, but overall it has been very solid.

NekoHTML is very easy to use.  For the most part it is a transparent
addition to a standard Xerces/Xalan configuration, and all the usual
APIs -- including JAXP -- seem to work as expected.

Nice work Andy.  Thank you for making NekoHTML available.


-- 
Fred Yankowski      fred@ontosys.com           tel: +1.630.879.1312
OntoSys, Inc	    PGP keyID: 7B449345        fax: +1.630.879.1370
www.ontosys.com     38W242 Deerpath Rd, Batavia, IL 60510-9461, USA

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by Stefano Mazzocchi <st...@apache.org>.

Sylvain Wallez wrote:
> 
> Andy Clark wrote:
> 
> >I've had a rather limited response to my NekoHTML parser based
> >on the Xerces Native Interface (XNI). So I'm still not sure if
> >it should take up space as part of the standard Xerces-J
> >project.
> >
> Having a limited response doesn't mean it's not interesting. Xerces 2 is
> rather new, and it will take some time for people to look at the new
> possibilities it offers. But you can be sure that parsing HTML is really
> a need for many people.
> 
> We've hacked AElfred to parse some "augmented" html (html with
> additional prefixed attributes), and XNI/NeckoHTML looks more robust and
> open than what we've done. Now that its license allows a wider use, we
> will have a deeper look at it in the coming weeks.
> 
> I will also look for its integration in Cocoon as a
> replacement/alternative for JTidy.
> 
> Thanks for this stuff. It would also gain a wider audience if it were
> part of Xerces-J.

I was thinking the exact same thing: cocoon is currently using JTidy in
order to parse HTML and provide XHTML that can be sent thru Cocoon
pipelines.

I fully agree with Sylvain that NeckoHTML would have a much higher
chance of use and growth (community-wise and adoption on other projects)
if it was integrated with Xerces-J.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by Sylvain Wallez <sy...@anyware-tech.com>.

Andy Clark wrote:

>I've had a rather limited response to my NekoHTML parser based
>on the Xerces Native Interface (XNI). So I'm still not sure if
>it should take up space as part of the standard Xerces-J 
>project. 
>
Having a limited response doesn't mean it's not interesting. Xerces 2 is 
rather new, and it will take some time for people to look at the new 
possibilities it offers. But you can be sure that parsing HTML is really 
a need for many people.

We've hacked AElfred to parse some "augmented" html (html with 
additional prefixed attributes), and XNI/NeckoHTML looks more robust and 
open than what we've done. Now that its license allows a wider use, we 
will have a deeper look at it in the coming weeks.

I will also look for its integration in Cocoon as a 
replacement/alternative for JTidy.

Thanks for this stuff. It would also gain a wider audience if it were 
part of Xerces-J.

>However... I've had enough of an interest that it makes sense 
>to make it more freely available for use. Therefore, I have 
>changed the license agreement to an Apache style license. You 
>can download the latest release from the following URL:
>
>  http://www.apache.org/~andyc/
>
>If interest increases then we could consider moving it into 
>the Xerces-J tree or having it hosted from possibly SourceForge.
>In short, development will continue as warranted. There are a
>few things in JTidy that look useful so that may be the next
>step I take with NekoHTML.
>
-- 
Sylvain Wallez
Anyware Technologies - http://www.anyware-tech.com




---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: NekoHTML Parser License Change

Posted by fr...@ontosys.com.

Andy's NekoHTML parser has worked well for me in a small project where
I needed to scrape some data from a set of HTML pages.  With NekoHTML
as the front end I was able to use an XSLT stylesheet to extract that
data directly from the HTML pages.

NekoHTML also allowed me to write a simple HTML transformation that I
find useful when analyzing HTML page layouts:  adding a small colored
border to each TABLE so that the table boundaries are visible.  This
transformation requires only a few lines of XSLT added to a standard
"identity" transformation.

I expect that NekoHTML would make it easy to translate HTML code into
XHTML format.  I have encountered a few tag-balancing glitches, where
NekoHTML struggles to accommodate ill-formed HTML code much as the
popular browsers do, but overall it has been very solid.

NekoHTML is very easy to use.  For the most part it is a transparent
addition to a standard Xerces/Xalan configuration, and all the usual
APIs -- including JAXP -- seem to work as expected.

Nice work Andy.  Thank you for making NekoHTML available.


-- 
Fred Yankowski      fred@ontosys.com           tel: +1.630.879.1312
OntoSys, Inc	    PGP keyID: 7B449345        fax: +1.630.879.1370
www.ontosys.com     38W242 Deerpath Rd, Batavia, IL 60510-9461, USA

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org