You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Oliver Moeller (JIRA)" <xe...@xml.apache.org> on 2015/11/02 16:55:27 UTC

[jira] [Created] (XERCESC-2054) Grammar serialization not portable (integer size / alignment issue)

Oliver Moeller created XERCESC-2054:
---------------------------------------

             Summary: Grammar serialization not portable (integer size / alignment issue)
                 Key: XERCESC-2054
                 URL: https://issues.apache.org/jira/browse/XERCESC-2054
             Project: Xerces-C++
          Issue Type: Bug
    Affects Versions: 3.0.2, 3.1.0, 3.1.1, 3.1.2
         Environment: Linux CentOS-7 (64bit), Windows 7 (64bit)
            Reporter: Oliver Moeller


Apologies if this is a known issue, but I have not found it by conventional
means (i.e., google an searching through the bug data base here).


I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.

The problem happens in XSerializeEngine::readString() when
the length of the string is taken from the associated BinInputStream as
"unsigned long":
    /***
     * Check if any data written
     ***/
    unsigned long tmp;
    *this>>tmp;

On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.

As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
it into a char array with
  examples/cxx/tree/embedded/grammar-input-stream.cxx
(a common xsd example)
will fail when "reading" it on the Linux box, because everything from the first
string on is garbage.

Moreover, this will (probably) give no meaningful error message, just a
"XSerialisationException" thrown, cause at some point it will (probably)
misinterpret wchar data as length information and try to read the next string
that is millions of bytes long (according to the misunderstood BinInputStream).
The BinInputStream will then run out of bytes.

A similar issue is present concerning the *alignment* of the data according to data type that happens for all >> operations: this is (necessarily) very
platform dependent.


It would be a big improvement, if xerces would encode the (de)serialization
in a platform/compiler independent manner. The purpose after all *IS* to be portable, right?

E.g., the serialisation engine could always use integers of known byte width
(e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".

ALso, the alignment issue should be addressed; it is hard to predict
what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
E.g., the data could be copied (to a properly aligned item initialized by 0s)
before doing the cast to an integer type.

In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
(Of course, the write operations have to follow the same business logic.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org