You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Oliver Moeller (JIRA)" <xe...@xml.apache.org> on 2015/11/02 16:57:27 UTC

[jira] [Updated] (XERCESC-2054) Grammar serialization not portable (integer size / alignment issue)

     [ https://issues.apache.org/jira/browse/XERCESC-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oliver Moeller updated XERCESC-2054:
------------------------------------
    Description: 
Apologies if this is a known issue, but I have not found it by conventional
means (i.e., google an searching through the bug data base here).


I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.

The problem happens in XSerializeEngine::readString() when
the length of the string is taken from the associated BinInputStream as
"unsigned long":
    /***
     * Check if any data written
     ***/
    unsigned long tmp;
    *this>>tmp;

On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.

As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
it into a char array with
  examples/cxx/tree/embedded/grammar-input-stream.cxx
which is a common xsd example)
will fail when "reading" it on the Linux box, because everything from the first
string on is garbage.

Moreover, this will (probably) give no meaningful error message, just a
"XSerialisationException" thrown, cause at some point it will (probably)
misinterpret wchar data as length information and try to read the next string
that is millions of bytes long (according to the misunderstood BinInputStream).
The BinInputStream will then run out of bytes.

A similar issue is present concerning the *alignment* of the data according to data type that happens for all >> operations: this is (necessarily) very
platform dependent.


It would be a big improvement, if xerces would encode the (de)serialization
in a platform/compiler independent manner. The purpose after all *IS* to be portable, right?

E.g., the serialisation engine could always use integers of known byte width
(e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".

ALso, the alignment issue should be addressed; it is hard to predict
what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
E.g., the data could be copied (to a properly aligned item initialized by 0s)
before doing the cast to an integer type.

In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
(Of course, the write operations have to follow the same business logic.)

  was:
Apologies if this is a known issue, but I have not found it by conventional
means (i.e., google an searching through the bug data base here).


I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.

The problem happens in XSerializeEngine::readString() when
the length of the string is taken from the associated BinInputStream as
"unsigned long":
    /***
     * Check if any data written
     ***/
    unsigned long tmp;
    *this>>tmp;

On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.

As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
it into a char array with
  examples/cxx/tree/embedded/grammar-input-stream.cxx
(a common xsd example)
will fail when "reading" it on the Linux box, because everything from the first
string on is garbage.

Moreover, this will (probably) give no meaningful error message, just a
"XSerialisationException" thrown, cause at some point it will (probably)
misinterpret wchar data as length information and try to read the next string
that is millions of bytes long (according to the misunderstood BinInputStream).
The BinInputStream will then run out of bytes.

A similar issue is present concerning the *alignment* of the data according to data type that happens for all >> operations: this is (necessarily) very
platform dependent.


It would be a big improvement, if xerces would encode the (de)serialization
in a platform/compiler independent manner. The purpose after all *IS* to be portable, right?

E.g., the serialisation engine could always use integers of known byte width
(e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".

ALso, the alignment issue should be addressed; it is hard to predict
what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
E.g., the data could be copied (to a properly aligned item initialized by 0s)
before doing the cast to an integer type.

In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
(Of course, the write operations have to follow the same business logic.)


> Grammar serialization not portable (integer size / alignment issue)
> -------------------------------------------------------------------
>
>                 Key: XERCESC-2054
>                 URL: https://issues.apache.org/jira/browse/XERCESC-2054
>             Project: Xerces-C++
>          Issue Type: Bug
>    Affects Versions: 3.0.2, 3.1.0, 3.1.1, 3.1.2
>         Environment: Linux CentOS-7 (64bit), Windows 7 (64bit)
>            Reporter: Oliver Moeller
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Apologies if this is a known issue, but I have not found it by conventional
> means (i.e., google an searching through the bug data base here).
> I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.
> The problem happens in XSerializeEngine::readString() when
> the length of the string is taken from the associated BinInputStream as
> "unsigned long":
>     /***
>      * Check if any data written
>      ***/
>     unsigned long tmp;
>     *this>>tmp;
> On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
> but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.
> As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
> it into a char array with
>   examples/cxx/tree/embedded/grammar-input-stream.cxx
> which is a common xsd example)
> will fail when "reading" it on the Linux box, because everything from the first
> string on is garbage.
> Moreover, this will (probably) give no meaningful error message, just a
> "XSerialisationException" thrown, cause at some point it will (probably)
> misinterpret wchar data as length information and try to read the next string
> that is millions of bytes long (according to the misunderstood BinInputStream).
> The BinInputStream will then run out of bytes.
> A similar issue is present concerning the *alignment* of the data according to data type that happens for all >> operations: this is (necessarily) very
> platform dependent.
> It would be a big improvement, if xerces would encode the (de)serialization
> in a platform/compiler independent manner. The purpose after all *IS* to be portable, right?
> E.g., the serialisation engine could always use integers of known byte width
> (e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".
> ALso, the alignment issue should be addressed; it is hard to predict
> what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
> E.g., the data could be copied (to a properly aligned item initialized by 0s)
> before doing the cast to an integer type.
> In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
> (Of course, the write operations have to follow the same business logic.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org