You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Derek Alexander <D....@lse.ac.uk> on 2009/07/22 15:55:23 UTC

repairing document while parsing?

Hi,

Is there any way with xerces (or any other xml parser you know of) to plug
in some kind of error handler that can attempt to repair the document being
parsed, rather than just log errors.

Specific case I have is xhtml documents that may have attribute values that
aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"

What I want to do is catch the error that &foo is not a known entity and
replace it with &amp;foo as it ought to be, and have the parser carry on
with that.

Cheers,
Derek


-- 
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: repairing document while parsing?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

"Jacob Kjome" <ho...@visi.com> wrote on 07/22/2009 01:00:18 PM:

> That's what NekoHTML is for.  Plus, it's a perfect fit for Xerces since
it's
> essentially an extension of it.

It's true that it does use XNI but it's really its own thing which can be
plugged in as the XMLParserConfiguration [1] instead of one of Xerces'
built-in ones. So basically all that survives from Xerces in this
configuration are the XNI to SAX and DOM converters.

> It's actively developed as well.
>
> http://nekohtml.sourceforge.net/
> http://sourceforge.net/projects/nekohtml/
>
> Jake
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Thanks.

[1] http://xerces.apache.org/xerces2-j/faq-xni.html#faq-2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: repairing document while parsing?

Posted by Jacob Kjome <ho...@visi.com>.

That's what NekoHTML is for.  Plus, it's a perfect fit for Xerces since it's 
essentially an extension of it.  It's actively developed as well.

http://nekohtml.sourceforge.net/
http://sourceforge.net/projects/nekohtml/

Jake

On Wed, 22 Jul 2009 07:52:17 -0700 (PDT)
  Derek Alexander <D....@lse.ac.uk> wrote:
> 
> Thanks for the reply. I had looked at the JTidy project. Unfortunately their
> current stable release removes empty tags which is no good for me, and too
> many errors are reported trying to build the latest source (which includes a
> config option for not deleting empty tags, if I understand correct). Seems
> I'll have to write something to pre-parse the docs.
> 
> Regards,
> Derek
> 
> 
> 
> keshlam wrote:
>> 
>> Closet thing I can think of is the W3C's "tidy" tool, which repairs some 
>> of the common/obvious errors.
>> 
>> ______________________________________
>> "... Three things see no end: A loop with exit code done wrong,
>> A semaphore untested, And the change that comes along. ..."
>>   -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
>> http://www.ovff.org/pegasus/songs/threes-rev-11.html)
>> 
>> 
>> 
>> Derek Alexander <D....@lse.ac.uk> 
>> 07/22/2009 09:55 AM
>> Please respond to
>> j-users@xerces.apache.org
>> 
>> 
>> To
>> j-users@xerces.apache.org
>> cc
>> 
>> Subject
>> repairing document while parsing?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Hi,
>> 
>> Is there any way with xerces (or any other xml parser you know of) to plug
>> in some kind of error handler that can attempt to repair the document 
>> being
>> parsed, rather than just log errors.
>> 
>> Specific case I have is xhtml documents that may have attribute values 
>> that
>> aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
>> 
>> What I want to do is catch the error that &foo is not a known entity and
>> replace it with &amp;foo as it ought to be, and have the parser carry on
>> with that.
>> 
>> Cheers,
>> Derek
>> 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
>> 
>> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-users-help@xerces.apache.org
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context: 
>http://www.nabble.com/repairing-document-while-parsing--tp24607002p24608002.html
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: repairing document while parsing?

Posted by Derek Alexander <D....@lse.ac.uk>.

Thanks for the reply. I had looked at the JTidy project. Unfortunately their
current stable release removes empty tags which is no good for me, and too
many errors are reported trying to build the latest source (which includes a
config option for not deleting empty tags, if I understand correct). Seems
I'll have to write something to pre-parse the docs.

Regards,
Derek



keshlam wrote:
> 
> Closet thing I can think of is the W3C's "tidy" tool, which repairs some 
> of the common/obvious errors.
> 
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>   -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
> http://www.ovff.org/pegasus/songs/threes-rev-11.html)
> 
> 
> 
> Derek Alexander <D....@lse.ac.uk> 
> 07/22/2009 09:55 AM
> Please respond to
> j-users@xerces.apache.org
> 
> 
> To
> j-users@xerces.apache.org
> cc
> 
> Subject
> repairing document while parsing?
> 
> 
> 
> 
> 
> 
> 
> Hi,
> 
> Is there any way with xerces (or any other xml parser you know of) to plug
> in some kind of error handler that can attempt to repair the document 
> being
> parsed, rather than just log errors.
> 
> Specific case I have is xhtml documents that may have attribute values 
> that
> aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"
> 
> What I want to do is catch the error that &foo is not a known entity and
> replace it with &amp;foo as it ought to be, and have the parser carry on
> with that.
> 
> Cheers,
> Derek
> 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html
> 
> Sent from the Xerces - J - Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24608002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: repairing document while parsing?

Posted by ke...@us.ibm.com.

Closet thing I can think of is the W3C's "tidy" tool, which repairs some 
of the common/obvious errors.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)



Derek Alexander <D....@lse.ac.uk> 
07/22/2009 09:55 AM
Please respond to
j-users@xerces.apache.org


To
j-users@xerces.apache.org
cc

Subject
repairing document while parsing?







Hi,

Is there any way with xerces (or any other xml parser you know of) to plug
in some kind of error handler that can attempt to repair the document 
being
parsed, rather than just log errors.

Specific case I have is xhtml documents that may have attribute values 
that
aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa"

What I want to do is catch the error that &foo is not a known entity and
replace it with &amp;foo as it ought to be, and have the parser carry on
with that.

Cheers,
Derek


-- 
View this message in context: 
http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html

Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org