You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Richard Kelly <ra...@gmail.com> on 2009/06/07 19:16:07 UTC

Re: GSoC Update

Hi everyone,

Just a bit of an update on my GSoC work.

I've written an XNI document filter that checks for character
normalization.  I understand the process of inserting it into the
pipeline, but I am a bit unsure about the xni component manager and
getting the relevant features.  Do i need to register my component
with the component manager to receive the feature notifications or
does that automatically occur?

I've also been looking around for relevant sections of existing code
that I will need to update. This is what I've come up with so far:

- Update DOMNormalizer to normalize characters if the appropriate
feature is set.
- Update DOMParserImpl to allow DOM_NORMALIZE_CHARACTERS and
DOM_CHECK_CHAR_NORMALIZATION features
- Add character normalization flags to the normalization section of
DOMConfigurationImpl.
- Update AbstractSAXParser to allow the UNICODE_NORMALIZATION_CHECKING_FEATURE.

Let me know if you think of any glaring omissions.

The LSSerializer class also has some attributes related to character
normalization [1].  However I believe the implementation in Xerces is
actually from the Xalan project which doesn't implement them.  Should
I be looking at adding character normalization support to their
project too?


Thanks,
Richard

[1] http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Nathan Beyer <nb...@gmail.com>.

On Aug 1, 2009, at 8:05 AM, Richard Kelly <ra...@gmail.com> wrote:

> Hi again,
>
> First of all, apologies for the late update, my new semester has
> started here and it's been quite busy.
> However I've still been working away on my code and I think the
> functionality is mostly complete at
> this stage,  but I still have some bugs to fix and a lot of testing  
> to do.
>
> I have a few more questions:
>
> I've added some code to normalize() functions in various Node
> subclasses.  Do I need to generate a new
> serialization version for these classes (I didn't change any data
> members)?  If so, how do i do that?

You probably don't need to generate a new one if none of the instance  
variables changed, but I'd have to see the code to know for sure.

However it's probably desirable to NOT change the ID and implement  
custom serialization to make any changes passive.

-Nathan
>
> Another question I had that I want to find out the name of the element
> currently being processed.  Is there
> a generic way to get this information from within the pipeline or do I
> need to keep track of the current
> element within my component?
>
>
> Thanks,
> Richard
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 08/10/2009 10:14:36 AM:

> Hi everyone,
>
> I've finished all the functionality for my component now and hope to
> put up the updated patches in a day or so.  I'm just double-checking
> for bugs and tidying up code now.
>
> Summary of the updated changes:
> - The Document.normalizeDocument() [1] will now perform character
> normalization if the "normalize-characters" flag is set in the DOM
> Configuration.
> - Node.normalize() [2] functions will perform character normalization
> of any text nodes in the subtree if the "normalize-characters" flag is
> set in the DOM Configuration.
> - Unicode characters that are split across multiple 'characters'
> events in the pipeline will now be handled correctly.
> - Changes to use SymbolTables, better error messages and other fixes
> as suggested by feedback.
>
> This week I'll be updating the documentation and finishing off the test
sets.

Sounds good. I think that covers everything required for these features.
I'm curious though... Will this include the ICU (normalization module) jar
and instructions on how to build it if we need to again? Wanted to try out
your first patch (rather than just reading it) but couldn't because you
didn't include the ICU jar the first time.

> Any feedback on the code is most welcome.
> thanks,
> Richard
>
> [1] http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-
> normalizeDocument
> [2] http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-normalize
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com.
E-mail: mrglavas@apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi everyone,

I've finished all the functionality for my component now and hope to
put up the updated patches in a day or so.  I'm just double-checking
for bugs and tidying up code now.

Summary of the updated changes:
- The Document.normalizeDocument() [1] will now perform character
normalization if the "normalize-characters" flag is set in the DOM
Configuration.
- Node.normalize() [2] functions will perform character normalization
of any text nodes in the subtree if the "normalize-characters" flag is
set in the DOM Configuration.
- Unicode characters that are split across multiple 'characters'
events in the pipeline will now be handled correctly.
- Changes to use SymbolTables, better error messages and other fixes
as suggested by feedback.

This week I'll be updating the documentation and finishing off the test sets.
Any feedback on the code is most welcome.
thanks,
Richard

[1] http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-normalizeDocument
[2] http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-normalize

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 08/01/2009 09:05:34 AM:

> Hi again,
>
> First of all, apologies for the late update, my new semester has
> started here and it's been quite busy.

No worries. University was still recent enough that I can still remember
how crazy busy it can get.

> However I've still been working away on my code and I think the
> functionality is mostly complete at
> this stage,  but I still have some bugs to fix and a lot of testing to
do.
>
> I have a few more questions:
>
> I've added some code to normalize() functions in various Node
> subclasses.  Do I need to generate a new
> serialization version for these classes (I didn't change any data
> members)?  If so, how do i do that?

I assume this is for normalizing the characters in text nodes if the
"normalize-characters" parameter is set to true. I expect that would only
mutate the values of existing fields. We've been trying to keep DOM
serialization compatible with earlier releases. Changing the values of
serialVersionUIDs would bust that. Some users actually depend on this so
need to be careful with the changes we make, but since you said "I didn't
change any data members" it sounds like your changes are safe.

> Another question I had that I want to find out the name of the element
> currently being processed.  Is there
> a generic way to get this information from within the pipeline or do I
> need to keep track of the current
> element within my component?

Aside from the startElement(), emptyElement() and endElement() methods on
XMLDocumentHandler there isn't anything else. If you need to keep track of
where you are in between these events, pushing (in startElement()) and
popping (in endElement()) the names on a stack works well.

> Thanks,
> Richard
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi again,

First of all, apologies for the late update, my new semester has
started here and it's been quite busy.
However I've still been working away on my code and I think the
functionality is mostly complete at
this stage,  but I still have some bugs to fix and a lot of testing to do.

I have a few more questions:

I've added some code to normalize() functions in various Node
subclasses.  Do I need to generate a new
serialization version for these classes (I didn't change any data
members)?  If so, how do i do that?

Another question I had that I want to find out the name of the element
currently being processed.  Is there
a generic way to get this information from within the pipeline or do I
need to keep track of the current
element within my component?


Thanks,
Richard

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi everyone,

Just a quick update on the status my work.  The past week I've been
working on extending the normalize functions in DOM
to support character normalization.  I've also been implementing
various changes based on the feedback from my initial code (thanks
Michael!).

Richard


2009/7/13 Michael Glavassevich <mr...@ca.ibm.com>:
> Hi Richard,
>
> Richard Kelly <ra...@gmail.com> wrote on 07/12/2009 05:59:36 AM:
>
>> Hi everyone,
>>
>> I've made some progress on my character normalization, and I
>> would like to get some feedback on my work to ensure I'm on the
>> right path.
>
> I've had an opportunity to review your code. What you have so far is looking
> really good. Great work!
>
>> I've uploaded the current state of my patches on JIRA [1].
>
> I do have some suggestions for improvements which I'll attach to the JIRA
> issue.
>
>> CharacterNormalizer.java is the new component that does the actual work.
>> CharacterNormalizer.patch is all the changes to existing files that I
>> needed to make.
>>
>> The relevant SAX [2] and DOM [3][4] character normalization features
>> do appear to be working as intended with these changes (except for the
>> tasks mentioned below).  I've implemented it as an XNI component as we
>> discussed and use two Xerces features to control this component and
>> determined whether or not it gets added to the pipeline.
>>
>> Still on my to do list:
>> - DOM Level 3 normalizeDocument() and Node.normalize() functions:
>> These functions don't use the pipeline so I am planning to add code to
>> directly call the component from within these functions.
>> - Multiple character data stream events are not handled correctly:
>> Since unicode characters can be larger than 16-bits they may get split
>> up across multiple calls to 'characters' events.  If this happens the
>> character may not be normalized correctly.  In order to avoid this, I
>> plan to use a buffer within my component to keep track of characters
>> that overlap these events.
>> - A comprehensive set of tests to check that the features work as
>> described in the standards.  I've done basic testing for a number of
>> cases (which it passed successfully) but obviously we would want
>> something more comprehensive and also do some performance testing.
>>
>> If anyone would like to take a look and see if there are any obvious
>> problems, that would be great.
>>
>> thanks,
>> Richard
>>
>> [1] https://issues.apache.org/jira/browse/XERCESJ-1383
>> [2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.
>> html#package_description
>> [3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-
>> character-normalization
>> [4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-
>> normalize-characters
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-dev-help@xerces.apache.org
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 07/12/2009 05:59:36 AM:

> Hi everyone,
>
> I've made some progress on my character normalization, and I
> would like to get some feedback on my work to ensure I'm on the
> right path.

I've had an opportunity to review your code. What you have so far is
looking really good. Great work!

> I've uploaded the current state of my patches on JIRA [1].

I do have some suggestions for improvements which I'll attach to the JIRA
issue.

> CharacterNormalizer.java is the new component that does the actual work.
> CharacterNormalizer.patch is all the changes to existing files that I
> needed to make.
>
> The relevant SAX [2] and DOM [3][4] character normalization features
> do appear to be working as intended with these changes (except for the
> tasks mentioned below).  I've implemented it as an XNI component as we
> discussed and use two Xerces features to control this component and
> determined whether or not it gets added to the pipeline.
>
> Still on my to do list:
> - DOM Level 3 normalizeDocument() and Node.normalize() functions:
> These functions don't use the pipeline so I am planning to add code to
> directly call the component from within these functions.
> - Multiple character data stream events are not handled correctly:
> Since unicode characters can be larger than 16-bits they may get split
> up across multiple calls to 'characters' events.  If this happens the
> character may not be normalized correctly.  In order to avoid this, I
> plan to use a buffer within my component to keep track of characters
> that overlap these events.
> - A comprehensive set of tests to check that the features work as
> described in the standards.  I've done basic testing for a number of
> cases (which it passed successfully) but obviously we would want
> something more comprehensive and also do some performance testing.
>
> If anyone would like to take a look and see if there are any obvious
> problems, that would be great.
>
> thanks,
> Richard
>
> [1] https://issues.apache.org/jira/browse/XERCESJ-1383
> [2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.
> html#package_description
> [3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-
> character-normalization
> [4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-
> normalize-characters
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi everyone,

I've made some progress on my character normalization, and I
would like to get some feedback on my work to ensure I'm on the
right path.

I've uploaded the current state of my patches on JIRA [1].

CharacterNormalizer.java is the new component that does the actual work.
CharacterNormalizer.patch is all the changes to existing files that I
needed to make.

The relevant SAX [2] and DOM [3][4] character normalization features
do appear to be working as intended with these changes (except for the
tasks mentioned below). I've implemented it as an XNI component as we
discussed and use two Xerces features to control this component and
determined whether or not it gets added to the pipeline.

Still on my to do list:
- DOM Level 3 normalizeDocument() and Node.normalize() functions:
These functions don't use the pipeline so I am planning to add code to
directly call the component from within these functions.
- Multiple character data stream events are not handled correctly:
Since unicode characters can be larger than 16-bits they may get split
up across multiple calls to 'characters' events. If this happens the
character may not be normalized correctly. In order to avoid this, I
plan to use a buffer within my component to keep track of characters
that overlap these events.
- A comprehensive set of tests to check that the features work as
described in the standards. I've done basic testing for a number of
cases (which it passed successfully) but obviously we would want
something more comprehensive and also do some performance testing.

If anyone would like to take a look and see if there are any obvious
problems, that would be great.

thanks,
Richard

[1] https://issues.apache.org/jira/browse/XERCESJ-1383
[2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description
[3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization
[4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 06/30/2009 08:37:23 AM:

> Hi everyone,
>
> Just a quick question about the XMLAttributes class.
>
> The other structures in the pipeline (QName, XMLString') are read-only
> and I copy the contents elsewhere if I want to modify them;
> e.g.
>
> private final QName fQName = new QName();
> ...
> protected QName normalize(QName qname) {
>  String prefix = qname.prefix != null ? normalize(qname.prefix) : null;
>  // etc
>  fQName.setValues(prefix, localpart, rawname, uri);
>  return fQName;
> }
>
> XMLAttributes, however, are read-write [1], so does this mean I should
> modifying the existing XMLAttributes directly or do I still need to
> copy the contents to another structure?

I think you can modify them in place. The components earlier in the
pipeline should already be done with this object and the ones after the
normalizer would want to be consuming the normalized XMLAttributes. There's
a special implementation of XMLAttributes in the org.apache.xerces.dom.
DOMNormalizer (called XMLAttributesProxy) which may not currently be
excepting that, though I'm sure it could be updated so that it's fully
read-write if it's necessary.

> thanks,
> Richard
>
> [1] http://xerces.apache.org/xerces2-
> j/javadocs/xni/org/apache/xerces/xni/XMLAttributes.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi everyone,

Just a quick question about the XMLAttributes class.

The other structures in the pipeline (QName, XMLString') are read-only
and I copy the contents elsewhere if I want to modify them;
e.g.

private final QName fQName = new QName();
...
protected QName normalize(QName qname) {
 String prefix = qname.prefix != null ? normalize(qname.prefix) : null;
 // etc
 fQName.setValues(prefix, localpart, rawname, uri);
 return fQName;
}


XMLAttributes, however, are read-write [1], so does this mean I should
modifying the existing XMLAttributes directly or do I still need to
copy the contents to another structure?

thanks,
Richard

[1] http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLAttributes.html

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 06/23/2009 12:00:18 PM:

> Hi all,
>
> I'm finally finishing my exams this week, so I'll be able to dedicate
> more time to this project.  I thought I'd give an update of where I'm
> at.
> So far, I've done this:
> - Created a character normalization component that performs unicode
> normalization.
> - Modified XML11Configuration to handle the new features and to add
> and remove the component from the pipeline when appropriate.
> - Modified AbstractSAXParser to handle the SAX character normalization
flags.
> - Created basic test files to ensure the features are working as
expected.
> - Extended the character normalization component to deal with
> composing characters.
> - Updated the XML messages for character normalization errors
> - Built the ICU4J component and updated build.xml to use it.

This sounds really good. Looking forward to seeing your first patch.

> At the moment, I'm trying to map the 'relevant constructs' [1] in the
> XML specfication to relevant Document Handler events.  These
> constructs consist of:
>    1.  The replacement text of all parsed entities
>    2.  All text matching, in context, one of the following
> productions:  CData, CharData, content, Name, Nmtoken.
>
> After looking through the XML specification and correlating the above
> with DocumentHandler functions [2], I've interpreted this to mean:
> - normalize the text of 'characters' events (since this event matches
> replacement text, CData, CharData and content productions)
> - normalize QNames and XMLAttributes in any events where they occur
> (this matches most Name and Nmtoken productions)
> - normalize name parameters in doctypeDecl, startGeneralEntity,
> processingInstruction, and endGeneralEntity events (additional
> structures in which Name productions occur)

Possibly more than that. I think normalization applies to all content in
the document (including comments) with an additional requirement "that none
of the relevant constructs listed above begins (after character references
are expanded) with a composing character as defined by B Definitions for
Character Normalization".

> If anyone can think of other events in which these productions are
> used, I would be most grateful if you could point them out.
>
> Thanks for all your assistance so far, it has been a great help.
> regards,
> Richard
>
>
> [1] http://www.w3.org/TR/xml11/#sec-normalization-checking
> [2]
http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLDocumentHandler.html

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi all,

I'm finally finishing my exams this week, so I'll be able to dedicate
more time to this project.  I thought I'd give an update of where I'm
at.
So far, I've done this:
- Created a character normalization component that performs unicode
normalization.
- Modified XML11Configuration to handle the new features and to add
and remove the component from the pipeline when appropriate.
- Modified AbstractSAXParser to handle the SAX character normalization flags.
- Created basic test files to ensure the features are working as expected.
- Extended the character normalization component to deal with
composing characters.
- Updated the XML messages for character normalization errors
- Built the ICU4J component and updated build.xml to use it.

At the moment, I'm trying to map the 'relevant constructs' [1] in the
XML specfication to relevant Document Handler events.  These
constructs consist of:
   1.  The replacement text of all parsed entities
   2.  All text matching, in context, one of the following
productions:  CData, CharData, content, Name, Nmtoken.

After looking through the XML specification and correlating the above
with DocumentHandler functions [2], I've interpreted this to mean:
- normalize the text of 'characters' events (since this event matches
replacement text, CData, CharData and content productions)
- normalize QNames and XMLAttributes in any events where they occur
(this matches most Name and Nmtoken productions)
- normalize name parameters in doctypeDecl, startGeneralEntity,
processingInstruction, and endGeneralEntity events (additional
structures in which Name productions occur)

If anyone can think of other events in which these productions are
used, I would be most grateful if you could point them out.

Thanks for all your assistance so far, it has been a great help.
regards,
Richard


[1] http://www.w3.org/TR/xml11/#sec-normalization-checking
[2] http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLDocumentHandler.html


2009/6/16 Michael Glavassevich <mr...@ca.ibm.com>:
> Hi Richard,
>
> The component you're looking for is the XMLErrorReporter [1]. It will take
> care of looking up and formatting the error messages (that you've added to
> the message file, e.g. XMLMessages.properties), creating the exception,
> supplying it with the right locator information and reporting the error to
> the user's error handler. You can obtain an instance of it from the
> XMLComponentManager by querying the
> "http://apache.org/xml/properties/internal/error-reporter" property. You'll
> find plenty of examples of its usage around the Xerces source (in particular
> other classes in org.apache.xerces.impl).
>
> Thanks.
>
> [1]
> http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/impl/XMLErrorReporter.html
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

The component you're looking for is the XMLErrorReporter [1]. It will take
care of looking up and formatting the error messages (that you've added to
the message file, e.g. XMLMessages.properties), creating the exception,
supplying it with the right locator information and reporting the error to
the user's error handler. You can obtain an instance of it from the
XMLComponentManager by querying the "
http://apache.org/xml/properties/internal/error-reporter" property. You'll
find plenty of examples of its usage around the Xerces source (in
particular other classes in org.apache.xerces.impl).

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/impl/XMLErrorReporter.html

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Richard Kelly <ra...@gmail.com> wrote on 06/15/2009 11:24:13 AM:

> Hi,
>
> I managed to get my code working with the SAX parser this week, so at
> the moment if you do something like:
>
>   saxParser.setFeature("http://xml.org/sax/features/unicode-
> normalization-checking",
> true);
>
> and give the parser an xml file with unicode that is not normalized,
> it will do this:
>
> error: Parse error occurred - check-character-normalization-failure
> org.xml.sax.SAXException: check-character-normalization-failure
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
>         at sax.Counter.main(Unknown Source)
>
> Obviously my next step is for more appropriate error handling.  At the
> moment my component is just throwing an XNI exception.  I think I
> should be converting this to a SAX exception and using a locator to
> identify the appropriate lines in the xml file.  Is that right?  The
> other class that I looked at was XMLParseException, but it doesn't
> seem to have a way of specifying severity.
>
> Thanks,
> Richard
>
>
>
> 2009/6/10 Michael Glavassevich <mr...@ca.ibm.com>:
> > Hi Richard,
> >
> > Richard Kelly <ra...@gmail.com> wrote on 06/07/2009 01:16:07 PM:
> >
> >> Hi everyone,
> >>
> >> Just a bit of an update on my GSoC work.
> >>
> >> I've written an XNI document filter that checks for character
> >> normalization.  I understand the process of inserting it into the
> >> pipeline, but I am a bit unsure about the xni component manager and
> >> getting the relevant features.  Do i need to register my component
> >> with the component manager to receive the feature notifications or
> >> does that automatically occur?
> >
> > Provided that you've registered your component with the component
manager
> > (e.g. calling addCommonComponent() in XML11Configuration) it will
receive
> > notifications when features and properties are changed and will also
get a
> > chance to read the configuration on reset. For performance reasonsit's
best
> > to defer registering the component (and also creating it) until the
> > character normalization feature is turned on.
> >
> >> I've also been looking around for relevant sections of existing code
> >> that I will need to update. This is what I've come up with so far:
> >>
> >> - Update DOMNormalizer to normalize characters if the appropriate
> >> feature is set.
> >> - Update DOMParserImpl to allow DOM_NORMALIZE_CHARACTERS and
> >> DOM_CHECK_CHAR_NORMALIZATION features
> >> - Add character normalization flags to the normalization section of
> >> DOMConfigurationImpl.
> >> - Update AbstractSAXParser to allow the
> >> UNICODE_NORMALIZATION_CHECKING_FEATURE.
> >
> > That sounds about right.
> >
> >> Let me know if you think of any glaring omissions.
> >>
> >> The LSSerializer class also has some attributes related to character
> >> normalization [1].  However I believe the implementation in Xerces is
> >> actually from the Xalan project which doesn't implement them.  Should
> >> I be looking at adding character normalization support to their
> >> project too?
> >
> > That's not something I'd considered when I proposed the project. I
think it
> > would be a good addition though probably something to leave until the
end if
> > you still have time (assuming you're interested in working on it).
> >
> >> Thanks,
> >> Richard
> >>
> >> [1] http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> >> For additional commands, e-mail: j-dev-help@xerces.apache.org
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Richard Kelly <ra...@gmail.com>.

Hi,

I managed to get my code working with the SAX parser this week, so at
the moment if you do something like:

  saxParser.setFeature("http://xml.org/sax/features/unicode-normalization-checking",
true);

and give the parser an xml file with unicode that is not normalized,
it will do this:

error: Parse error occurred - check-character-normalization-failure
org.xml.sax.SAXException: check-character-normalization-failure
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at sax.Counter.main(Unknown Source)

Obviously my next step is for more appropriate error handling.  At the
moment my component is just throwing an XNI exception.  I think I
should be converting this to a SAX exception and using a locator to
identify the appropriate lines in the xml file.  Is that right?  The
other class that I looked at was XMLParseException, but it doesn't
seem to have a way of specifying severity.

Thanks,
Richard



2009/6/10 Michael Glavassevich <mr...@ca.ibm.com>:
> Hi Richard,
>
> Richard Kelly <ra...@gmail.com> wrote on 06/07/2009 01:16:07 PM:
>
>> Hi everyone,
>>
>> Just a bit of an update on my GSoC work.
>>
>> I've written an XNI document filter that checks for character
>> normalization.  I understand the process of inserting it into the
>> pipeline, but I am a bit unsure about the xni component manager and
>> getting the relevant features.  Do i need to register my component
>> with the component manager to receive the feature notifications or
>> does that automatically occur?
>
> Provided that you've registered your component with the component manager
> (e.g. calling addCommonComponent() in XML11Configuration) it will receive
> notifications when features and properties are changed and will also get a
> chance to read the configuration on reset. For performance reasons it's best
> to defer registering the component (and also creating it) until the
> character normalization feature is turned on.
>
>> I've also been looking around for relevant sections of existing code
>> that I will need to update. This is what I've come up with so far:
>>
>> - Update DOMNormalizer to normalize characters if the appropriate
>> feature is set.
>> - Update DOMParserImpl to allow DOM_NORMALIZE_CHARACTERS and
>> DOM_CHECK_CHAR_NORMALIZATION features
>> - Add character normalization flags to the normalization section of
>> DOMConfigurationImpl.
>> - Update AbstractSAXParser to allow the
>> UNICODE_NORMALIZATION_CHECKING_FEATURE.
>
> That sounds about right.
>
>> Let me know if you think of any glaring omissions.
>>
>> The LSSerializer class also has some attributes related to character
>> normalization [1].  However I believe the implementation in Xerces is
>> actually from the Xalan project which doesn't implement them.  Should
>> I be looking at adding character normalization support to their
>> project too?
>
> That's not something I'd considered when I proposed the project. I think it
> would be a good addition though probably something to leave until the end if
> you still have time (assuming you're interested in working on it).
>
>> Thanks,
>> Richard
>>
>> [1] http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-dev-help@xerces.apache.org
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: GSoC Update

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Richard,

Richard Kelly <ra...@gmail.com> wrote on 06/07/2009 01:16:07 PM:

> Hi everyone,
>
> Just a bit of an update on my GSoC work.
>
> I've written an XNI document filter that checks for character
> normalization.  I understand the process of inserting it into the
> pipeline, but I am a bit unsure about the xni component manager and
> getting the relevant features.  Do i need to register my component
> with the component manager to receive the feature notifications or
> does that automatically occur?

Provided that you've registered your component with the component manager
(e.g. calling addCommonComponent() in XML11Configuration) it will receive
notifications when features and properties are changed and will also get a
chance to read the configuration on reset. For performance reasons it's
best to defer registering the component (and also creating it) until the
character normalization feature is turned on.

> I've also been looking around for relevant sections of existing code
> that I will need to update. This is what I've come up with so far:
>
> - Update DOMNormalizer to normalize characters if the appropriate
> feature is set.
> - Update DOMParserImpl to allow DOM_NORMALIZE_CHARACTERS and
> DOM_CHECK_CHAR_NORMALIZATION features
> - Add character normalization flags to the normalization section of
> DOMConfigurationImpl.
> - Update AbstractSAXParser to allow the
> UNICODE_NORMALIZATION_CHECKING_FEATURE.

That sounds about right.

> Let me know if you think of any glaring omissions.
>
> The LSSerializer class also has some attributes related to character
> normalization [1].  However I believe the implementation in Xerces is
> actually from the Xalan project which doesn't implement them.  Should
> I be looking at adding character normalization support to their
> project too?

That's not something I'd considered when I proposed the project. I think it
would be a good addition though probably something to leave until the end
if you still have time (assuming you're interested in working on it).

> Thanks,
> Richard
>
> [1] http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org