You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Derek Alexander <D....@lse.ac.uk> on 2009/07/22 15:55:23 UTC
repairing document while parsing?
Hi,
Is there any way with xerces (or any other xml parser you know of) to plug
in some kind of error handler that can attempt to repair the document being
parsed, rather than just log errors.
Specific case I have is xhtml documents that may have attribute values that
aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
What I want to do is catch the error that &foo is not a known entity and
replace it with &foo as it ought to be, and have the parser carry on
with that.
Cheers,
Derek
--
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: repairing document while parsing?
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
"Jacob Kjome" <ho...@visi.com> wrote on 07/22/2009 01:00:18 PM:
> That's what NekoHTML is for. Plus, it's a perfect fit for Xerces since
it's
> essentially an extension of it.
It's true that it does use XNI but it's really its own thing which can be
plugged in as the XMLParserConfiguration [1] instead of one of Xerces'
built-in ones. So basically all that survives from Xerces in this
configuration are the XNI to SAX and DOM converters.
> It's actively developed as well.
>
> http://nekohtml.sourceforge.net/
> http://sourceforge.net/projects/nekohtml/
>
> Jake
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
Thanks.
[1] http://xerces.apache.org/xerces2-j/faq-xni.html#faq-2
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Re: repairing document while parsing?
Posted by Jacob Kjome <ho...@visi.com>.
That's what NekoHTML is for. Plus, it's a perfect fit for Xerces since it's
essentially an extension of it. It's actively developed as well.
http://nekohtml.sourceforge.net/
http://sourceforge.net/projects/nekohtml/
Jake
On Wed, 22 Jul 2009 07:52:17 -0700 (PDT)
Derek Alexander <D....@lse.ac.uk> wrote:
>
> Thanks for the reply. I had looked at the JTidy project. Unfortunately their
> current stable release removes empty tags which is no good for me, and too
> many errors are reported trying to build the latest source (which includes a
> config option for not deleting empty tags, if I understand correct). Seems
> I'll have to write something to pre-parse the docs.
>
> Regards,
> Derek
>
>
>
> keshlam wrote:
>>
>> Closet thing I can think of is the W3C's "tidy" tool, which repairs some
>> of the common/obvious errors.
>>
>> ______________________________________
>> "... Three things see no end: A loop with exit code done wrong,
>> A semaphore untested, And the change that comes along. ..."
>> -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
>> http://www.ovff.org/pegasus/songs/threes-rev-11.html)
>>
>>
>>
>> Derek Alexander <D....@lse.ac.uk>
>> 07/22/2009 09:55 AM
>> Please respond to
>> j-users@xerces.apache.org
>>
>>
>> To
>> j-users@xerces.apache.org
>> cc
>>
>> Subject
>> repairing document while parsing?
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>> Is there any way with xerces (or any other xml parser you know of) to plug
>> in some kind of error handler that can attempt to repair the document
>> being
>> parsed, rather than just log errors.
>>
>> Specific case I have is xhtml documents that may have attribute values
>> that
>> aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
>>
>> What I want to do is catch the error that &foo is not a known entity and
>> replace it with &foo as it ought to be, and have the parser carry on
>> with that.
>>
>> Cheers,
>> Derek
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
>>
>> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-users-help@xerces.apache.org
>>
>>
>>
>>
>
> --
> View this message in context:
>http://www.nabble.com/repairing-document-while-parsing--tp24607002p24608002.html
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>For additional commands, e-mail: j-users-help@xerces.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: repairing document while parsing?
Posted by Derek Alexander <D....@lse.ac.uk>.
Thanks for the reply. I had looked at the JTidy project. Unfortunately their
current stable release removes empty tags which is no good for me, and too
many errors are reported trying to build the latest source (which includes a
config option for not deleting empty tags, if I understand correct). Seems
I'll have to write something to pre-parse the docs.
Regards,
Derek
keshlam wrote:
>
> Closet thing I can think of is the W3C's "tidy" tool, which repairs some
> of the common/obvious errors.
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
> -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
> http://www.ovff.org/pegasus/songs/threes-rev-11.html)
>
>
>
> Derek Alexander <D....@lse.ac.uk>
> 07/22/2009 09:55 AM
> Please respond to
> j-users@xerces.apache.org
>
>
> To
> j-users@xerces.apache.org
> cc
>
> Subject
> repairing document while parsing?
>
>
>
>
>
>
>
> Hi,
>
> Is there any way with xerces (or any other xml parser you know of) to plug
> in some kind of error handler that can attempt to repair the document
> being
> parsed, rather than just log errors.
>
> Specific case I have is xhtml documents that may have attribute values
> that
> aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
>
> What I want to do is catch the error that &foo is not a known entity and
> replace it with &foo as it ought to be, and have the parser carry on
> with that.
>
> Cheers,
> Derek
>
>
> --
> View this message in context:
> http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
>
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>
>
>
>
--
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24608002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: repairing document while parsing?
Posted by ke...@us.ibm.com.
Closet thing I can think of is the W3C's "tidy" tool, which repairs some
of the common/obvious errors.
______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
-- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)
Derek Alexander <D....@lse.ac.uk>
07/22/2009 09:55 AM
Please respond to
j-users@xerces.apache.org
To
j-users@xerces.apache.org
cc
Subject
repairing document while parsing?
Hi,
Is there any way with xerces (or any other xml parser you know of) to plug
in some kind of error handler that can attempt to repair the document
being
parsed, rather than just log errors.
Specific case I have is xhtml documents that may have attribute values
that
aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
What I want to do is catch the error that &foo is not a known entity and
replace it with &foo as it ought to be, and have the parser carry on
with that.
Cheers,
Derek
--
View this message in context:
http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org