You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Ben Griffin <be...@redsnapper.net> on 2009/03/11 15:48:32 UTC

LoadGrammar Error?

Okay - I've been staring at this for four days now.
Here is a small example of what is bugging me:
-----------------
	class Err: public DOMErrorHandler {
		bool Err::handleError(const xercesc::DOMError& domError) {
			std::cerr << transcode(domError.getMessage());
			return true;
		}
	};

	int main(int argc, char *argv[]) {
		XMLPlatformUtils::Initialize();
		transcoder = XMLPlatformUtils::fgTransService- 
 >makeNewLCPTranscoder(XMLPlatformUtils::fgMemoryManager);
		
		std::string grammar_str = "<xs:schema targetNamespace=\"http://my.org/blah 
\" xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" ><xs:attribute name= 
\"box\" fixed=\"true\" /></xs:schema>";
		XMLCh* grammar_file = transcoder->transcode(grammar_str.c_str());
		Grammar::GrammarType grammar_type = Grammar::SchemaGrammarType;
		DOMImplementation* impl =  
DOMImplementationRegistry::getDOMImplementation(X("LS"));
		DOMLSParser* parser = ((DOMImplementationLS*)impl)- 
 >createLSParser(DOMImplementationLS::MODE_SYNCHRONOUS, 0);
		
		DOMConfiguration* dc = parser->getDomConfig();
		Err* errorHandler = new Err();
		dc->setParameter(XMLUni::fgDOMErrorHandler,errorHandler);
		dc->setParameter(XMLUni::fgXercesUseCachedGrammarInParse, true);		
		dc->setParameter(XMLUni::fgXercesSchema, true);	
		dc->setParameter(XMLUni::fgXercesCacheGrammarFromParse, true);		
		dc->setParameter(XMLUni::fgDOMValidate, true);
		
		DOMLSInput* input = ((DOMImplementationLS*)impl)->createLSInput();
		input->setStringData(grammar_file);
		parser->loadGrammar(input, grammar_type, true);
	
	//	[...]

	}
-----------------------------------------------
An error is being thrown by  IGXMLScanner::scanStartTagNS because  
fQNameBuf is not being loaded by ReaderMgr.getQName because  
isFirstNCNameChar is returning false.

     if (!fReaderMgr.getQName(fQNameBuf, &prefixColonPos)) {
         if (fQNameBuf.isEmpty())
             emitError(XMLErrs::ExpectedElementName); // <-- Error  
thrown here.
         else


//false being returned by XMLReader::isFirstNCNameChar.
inline bool XMLReader::isFirstNCNameChar(const XMLCh toCheck) const {
     return (((fgCharCharsTable[toCheck] & gFirstNameCharMask) != 0)
             && (toCheck != chColon));
}

The reason is that the schema characters in fCharBuf have been  
converted twice. (note that this is little-endian)
(what follows is the start of a memory dump of the fCharBuf )
3c 00 00 00 78 00 00 00 73 00 00 00 3a 00 00 00
73 00 00 00 63 00 00 00 68 00 00 00 65 00 00 00
6d 00 00 00 61 00 00 00 20 00 00 00 74 00 00 00
61 00 00 00 72 00 00 00 67 00 00 00 65 00 00 00
74 00 00 00 4e 00 00 00 61 00 00 00 6d 00 00 00
65 00 00 00 73 00 00 00 70 00 00 00 61 00 00 00

#0	0x00fe3453 in xercesc_3_0::Wrapper4DOMLSInput::makeStream at  
Wrapper4DOMLSInput.cpp:132
#1	0x01011e7b in xercesc_3_0::ReaderMgr::createReader at ReaderMgr.cpp: 
365
#2	0x0100d6f7 in xercesc_3_0::IGXMLScanner::scanReset at  
IGXMLScanner2.cpp:1362
#3	0x01003c1b in xercesc_3_0::IGXMLScanner::scanDocument at  
IGXMLScanner.cpp:197
#4	0x0105b587 in xercesc_3_0::AbstractDOMParser::parse at  
AbstractDOMParser.cpp:535
#5	0x01008845 in xercesc_3_0::IGXMLScanner::loadXMLSchemaGrammar at  
IGXMLScanner2.cpp:2085
#6	0x00ffee5f in xercesc_3_0::IGXMLScanner::loadGrammar at  
IGXMLScanner.cpp:3005
#7	0x010616c9 in xercesc_3_0::DOMLSParserImpl::loadGrammar at  
DOMLSParserImpl.cpp:935

//So here we see the culprit -
BinInputStream* Wrapper4DOMLSInput::makeStream() const {
     // The LSParser will use the LSInput object to determine how to  
read data. The LSParser will look at the different inputs specified in  
the
     // LSInput in the following order to know which one to read from,  
the first one that is not null and not an empty string will be used:
     //   1. LSInput.characterStream
     //   2. LSInput.byteStream
     //   3. LSInput.stringData
     //   4. LSInput.systemId
     //   5. LSInput.publicId

     InputSource* binStream=fInputSource->getByteStream();
     if(binStream)
         return binStream->makeStream();
     const XMLCh* xmlString=fInputSource->getStringData();
     if(xmlString)
     {
         MemBufInputSource is((const XMLByte*)xmlString,  
XMLString::stringLen(xmlString)*sizeof(XMLCh), "", false,  
getMemoryManager()); // <--!!!! what?!
         is.setCopyBufToStream(false);
         return is.makeStream();
     }
-----------------------------------------------

First of all the fact that this function first looks at the byteStream  
MUST be a bug.
Secondly, the characterStream is being CONVERTED - when it should  
already be an XMLCh* (as defined everywhere else)


Or am I missing a trick?

Re: LoadGrammar Error?

Posted by Alberto Massari <am...@datadirect.com>.

Hi Ben,
the cast in the MemBufInputSource is fine, as it is simply a wrapper for 
a bunch of bytes, regardless of which encoding they are using. The only 
thing that can be made to avoid your case (a missing XML header in the 
string) is adding the call to

is.setEncoding(XMLUni::fgXMLChEncodingString);

after the creation of the object.

Alberto

Ben Griffin wrote:
> Alberto, thanks for your time.
>
> On 11 Mar 2009, at 15:46, Alberto Massari wrote:
>> Hi Ben,
>> 1) why do you think that Wrapper4LSInput shouldn't look at the 
>> byteStream? The specs list this order
>
> Okay - I see that there is no  LSInput.characterStream, which is (sort 
> of) fair enough, so I agree that the order is therefore correct.
>>
>> 2) the stringData is not being converted: MemBufInputSource works on 
>> a byte stream, so it needs a cast and a size computed by multiplying 
>> sizeof(XMLCh) by the length (in UTF-16 chars) of the string.
>
> Well, here I have to disagree. Look at the (fragment of ) makeStream 
> below:
>
>             BinInputStream* Wrapper4DOMLSInput::makeStream() const {
>                 // The LSParser will use the LSInput object to 
> determine how to read data. The LSParser will look at the different 
> inputs specified in the
>                 // LSInput in the following order to know which one to 
> read from, the first one that is not null and not an empty string will 
> be used:
>                 //   1. LSInput.characterStream
>                 //   2. LSInput.byteStream
>                 //   3. LSInput.stringData
>                 //   4. LSInput.systemId
>                 //   5. LSInput.publicId
>                 InputSource* binStream=fInputSource->getByteStream();
>                 if(binStream)
>                     return binStream->makeStream();
> --->                const XMLCh* xmlString=fInputSource->getStringData();
> // xmlString is a XMLCh*, as created using LSInput->setStringData()
>
>                 if(xmlString)
>                 {
>
> -->                    MemBufInputSource is((const XMLByte*)xmlString, 
> XMLString::stringLen(xmlString)*sizeof(XMLCh), "", false, 
> getMemoryManager());
> //So why is it being CAST into XMLByte here?
> /And now "is" is being instantiated as if the xmlString is a XMLByte* 
> ....
>
>                    is.setCopyBufToStream(false);
>                      return is.makeStream();
>
> //...which makes a  BinInputStream* from  "is"
>
> Now, THAT goes onto instantiate a XMLReader which does an initial load 
> of raw bytes.
>     refreshRawBuffer();
>
> and then uses.. and XMLRecognizer to test the Encoding.. HANG ON - 
> this is meant to be XMLCh...
> ... anyway... That should be FINE if it returns the same encoding as a 
> XMLCh.
>
> So being a XMLCh* - the grammar starts (in terms of bytes)  3c 00
>
> XMLRecognizer::basicEncodingProbe(  const   XMLByte* const  rawBuffer 
> , const XMLSize_t       rawByteCount)
>
> Because this doesn't actually know about non BOM  UTF-16BE or UTF-16LE 
> (ie, the XMLCh encoding), it is going to return  "UTF-8".
>
> Likewise, the grammar string does not have an <?xml ..> declaration, 
> (which is legal) the XMLRecognizer is going to fail.
>
> As you can imagine, once the BinInputStream has been identified as 
> UTF-8, there really is no turning back.
>
> Sure enough, now AbstractDOMParser::startDocument() calls
> fDocument->setInputEncoding(fScanner->getReaderMgr()->getCurrentEncodingStr()); 
>
>
> Just in time for
> IGXMLScanner::scanDocument(const InputSource& src) to call   
> scanStartTagNS(gotData)
>
> This then hits trouble at (!fReaderMgr.getQName(fQNameBuf, 
> &prefixColonPos)) which return empty
> and the empty will emit an Error.
>
>>
>> As for the error you see, are you sure your 
>> transcoder->transcoder(grammar_str.c_str()) is actually generating a 
>> string of XMLCh? Could you post  its code?
>
> My transcoder?
>
> XMLLCPTranscoder* transcoder = 
> XMLPlatformUtils::fgTransService->makeNewLCPTranscoder(XMLPlatformUtils::fgMemoryManager); 
>
>>
>
>
> Best regards
>     Ben.
>

Re: LoadGrammar Error?

Posted by Ben Griffin <be...@redsnapper.net>.

Alberto, thanks for your time.

On 11 Mar 2009, at 15:46, Alberto Massari wrote:
> Hi Ben,
> 1) why do you think that Wrapper4LSInput shouldn't look at the  
> byteStream? The specs list this order

Okay - I see that there is no  LSInput.characterStream, which is (sort  
of) fair enough, so I agree that the order is therefore correct.
>
> 2) the stringData is not being converted: MemBufInputSource works on  
> a byte stream, so it needs a cast and a size computed by multiplying  
> sizeof(XMLCh) by the length (in UTF-16 chars) of the string.

Well, here I have to disagree. Look at the (fragment of ) makeStream  
below:

			BinInputStream* Wrapper4DOMLSInput::makeStream() const {
			    // The LSParser will use the LSInput object to determine how to  
read data. The LSParser will look at the different inputs specified in  
the
			    // LSInput in the following order to know which one to read  
from, the first one that is not null and not an empty string will be  
used:
			    //   1. LSInput.characterStream
			    //   2. LSInput.byteStream
			    //   3. LSInput.stringData
			    //   4. LSInput.systemId
			    //   5. LSInput.publicId
			    InputSource* binStream=fInputSource->getByteStream();
			    if(binStream)
			        return binStream->makeStream();
--->			    const XMLCh* xmlString=fInputSource->getStringData();
// xmlString is a XMLCh*, as created using LSInput->setStringData()

			    if(xmlString)
			    {

-->			        MemBufInputSource is((const XMLByte*)xmlString,  
XMLString::stringLen(xmlString)*sizeof(XMLCh), "", false,  
getMemoryManager());
//So why is it being CAST into XMLByte here?
/And now "is" is being instantiated as if the xmlString is a  
XMLByte* ....

			       is.setCopyBufToStream(false);
		             return is.makeStream();

//...which makes a  BinInputStream* from  "is"

Now, THAT goes onto instantiate a XMLReader which does an initial load  
of raw bytes.
     refreshRawBuffer();

and then uses.. and XMLRecognizer to test the Encoding.. HANG ON -  
this is meant to be XMLCh...
... anyway... That should be FINE if it returns the same encoding as a  
XMLCh.

So being a XMLCh* - the grammar starts (in terms of bytes)  3c 00

XMLRecognizer::basicEncodingProbe(  const   XMLByte* const   
rawBuffer , const XMLSize_t       rawByteCount)

Because this doesn't actually know about non BOM  UTF-16BE or UTF-16LE  
(ie, the XMLCh encoding), it is going to return  "UTF-8".

Likewise, the grammar string does not have an <?xml ..> declaration,  
(which is legal) the XMLRecognizer is going to fail.

As you can imagine, once the BinInputStream has been identified as  
UTF-8, there really is no turning back.

Sure enough, now AbstractDOMParser::startDocument() calls
fDocument->setInputEncoding(fScanner->getReaderMgr()- 
 >getCurrentEncodingStr());

Just in time for
IGXMLScanner::scanDocument(const InputSource& src) to call    
scanStartTagNS(gotData)

This then hits trouble at (!fReaderMgr.getQName(fQNameBuf,  
&prefixColonPos)) which return empty
and the empty will emit an Error.

>
> As for the error you see, are you sure your transcoder- 
> >transcoder(grammar_str.c_str()) is actually generating a string of  
> XMLCh? Could you post  its code?

My transcoder?

XMLLCPTranscoder* transcoder = XMLPlatformUtils::fgTransService- 
 >makeNewLCPTranscoder(XMLPlatformUtils::fgMemoryManager);
>


Best regards
	Ben.

Re: LoadGrammar Error?

Posted by Alberto Massari <am...@datadirect.com>.

Hi Ben,
1) why do you think that Wrapper4LSInput shouldn't look at the 
byteStream? The specs list this order

   1. |LSInput.characterStream|
      <http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSInput-characterStream>

   2. |LSInput.byteStream|
      <http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSInput-byteStream>

   3. |LSInput.stringData|
      <http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSInput-stringData>

   4. |LSInput.systemId|
      <http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSInput-systemId>

   5. |LSInput.publicId|
      <http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSInput-publicId>


and the first item, characterStream (of type LSReader) is not available 
in Xerces-C++, as allowed by the specs (LSReader is an Object, so its 
purpose is to allow the use of java.lang.String).

2) the stringData is not being converted: MemBufInputSource works on a 
byte stream, so it needs a cast and a size computed by multiplying 
sizeof(XMLCh) by the length (in UTF-16 chars) of the string.

As for the error you see, are you sure your 
transcoder->transcoder(grammar_str.c_str()) is actually generating a 
string of XMLCh? Could you post  its code?

Alberto

Ben Griffin wrote:
> Okay - I've been staring at this for four days now.
> Here is a small example of what is bugging me:
> -----------------
>     class Err: public DOMErrorHandler {
>         bool Err::handleError(const xercesc::DOMError& domError) {
>             std::cerr << transcode(domError.getMessage());
>             return true;
>         }
>     };
>
>     int main(int argc, char *argv[]) {
>         XMLPlatformUtils::Initialize();
>         transcoder = 
> XMLPlatformUtils::fgTransService->makeNewLCPTranscoder(XMLPlatformUtils::fgMemoryManager); 
>
>        
>         std::string grammar_str = "<xs:schema 
> targetNamespace=\"http://my.org/blah\" 
> xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" ><xs:attribute 
> name=\"box\" fixed=\"true\" /></xs:schema>";
>         XMLCh* grammar_file = transcoder->transcode(grammar_str.c_str());
>         Grammar::GrammarType grammar_type = Grammar::SchemaGrammarType;
>         DOMImplementation* impl = 
> DOMImplementationRegistry::getDOMImplementation(X("LS"));
>         DOMLSParser* parser = 
> ((DOMImplementationLS*)impl)->createLSParser(DOMImplementationLS::MODE_SYNCHRONOUS, 
> 0);
>        
>         DOMConfiguration* dc = parser->getDomConfig();
>         Err* errorHandler = new Err();
>         dc->setParameter(XMLUni::fgDOMErrorHandler,errorHandler);
>         dc->setParameter(XMLUni::fgXercesUseCachedGrammarInParse, 
> true);       
>         dc->setParameter(XMLUni::fgXercesSchema, true);   
>         dc->setParameter(XMLUni::fgXercesCacheGrammarFromParse, 
> true);       
>         dc->setParameter(XMLUni::fgDOMValidate, true);
>        
>         DOMLSInput* input = 
> ((DOMImplementationLS*)impl)->createLSInput();
>         input->setStringData(grammar_file);
>         parser->loadGrammar(input, grammar_type, true);
>     
>     //    [...]
>
>     }
> -----------------------------------------------
> An error is being thrown by  IGXMLScanner::scanStartTagNS because 
> fQNameBuf is not being loaded by ReaderMgr.getQName because 
> isFirstNCNameChar is returning false.
>
>     if (!fReaderMgr.getQName(fQNameBuf, &prefixColonPos)) {
>         if (fQNameBuf.isEmpty())
>             emitError(XMLErrs::ExpectedElementName); // <-- Error 
> thrown here.
>         else
>
>
> //false being returned by XMLReader::isFirstNCNameChar.
> inline bool XMLReader::isFirstNCNameChar(const XMLCh toCheck) const {
>     return (((fgCharCharsTable[toCheck] & gFirstNameCharMask) != 0)
>             && (toCheck != chColon));
> }
>
> The reason is that the schema characters in fCharBuf have been 
> converted twice. (note that this is little-endian)
> (what follows is the start of a memory dump of the fCharBuf )
> 3c 00 00 00 78 00 00 00 73 00 00 00 3a 00 00 00
> 73 00 00 00 63 00 00 00 68 00 00 00 65 00 00 00
> 6d 00 00 00 61 00 00 00 20 00 00 00 74 00 00 00
> 61 00 00 00 72 00 00 00 67 00 00 00 65 00 00 00
> 74 00 00 00 4e 00 00 00 61 00 00 00 6d 00 00 00
> 65 00 00 00 73 00 00 00 70 00 00 00 61 00 00 00
>
> #0    0x00fe3453 in xercesc_3_0::Wrapper4DOMLSInput::makeStream at 
> Wrapper4DOMLSInput.cpp:132
> #1    0x01011e7b in xercesc_3_0::ReaderMgr::createReader at 
> ReaderMgr.cpp:365
> #2    0x0100d6f7 in xercesc_3_0::IGXMLScanner::scanReset at 
> IGXMLScanner2.cpp:1362
> #3    0x01003c1b in xercesc_3_0::IGXMLScanner::scanDocument at 
> IGXMLScanner.cpp:197
> #4    0x0105b587 in xercesc_3_0::AbstractDOMParser::parse at 
> AbstractDOMParser.cpp:535
> #5    0x01008845 in xercesc_3_0::IGXMLScanner::loadXMLSchemaGrammar at 
> IGXMLScanner2.cpp:2085
> #6    0x00ffee5f in xercesc_3_0::IGXMLScanner::loadGrammar at 
> IGXMLScanner.cpp:3005
> #7    0x010616c9 in xercesc_3_0::DOMLSParserImpl::loadGrammar at 
> DOMLSParserImpl.cpp:935
>
> //So here we see the culprit -
> BinInputStream* Wrapper4DOMLSInput::makeStream() const {
>     // The LSParser will use the LSInput object to determine how to 
> read data. The LSParser will look at the different inputs specified in 
> the
>     // LSInput in the following order to know which one to read from, 
> the first one that is not null and not an empty string will be used:
>     //   1. LSInput.characterStream
>     //   2. LSInput.byteStream
>     //   3. LSInput.stringData
>     //   4. LSInput.systemId
>     //   5. LSInput.publicId
>
>     InputSource* binStream=fInputSource->getByteStream();
>     if(binStream)
>         return binStream->makeStream();
>     const XMLCh* xmlString=fInputSource->getStringData();
>     if(xmlString)
>     {
>         MemBufInputSource is((const XMLByte*)xmlString, 
> XMLString::stringLen(xmlString)*sizeof(XMLCh), "", false, 
> getMemoryManager()); // <--!!!! what?!
>         is.setCopyBufToStream(false);
>         return is.makeStream();
>     }
> -----------------------------------------------
>
> First of all the fact that this function first looks at the byteStream 
> MUST be a bug.
> Secondly, the characterStream is being CONVERTED - when it should 
> already be an XMLCh* (as defined everywhere else)
>
>
> Or am I missing a trick?
>