You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@plc4x.apache.org by Christofer Dutz <ch...@c-ware.de> on 2022/11/07 13:59:53 UTC

[DISCUSS] Handling string fields

Hi all,

I’m currently having a bit of a discussion with Sebastian. I would like to hear your opinion on this.

I recently updated the reader and writer to handle strings differently.

Previously we simply gobbled up all bytes and converted that into a string. This resulted in the desired string to be located in the beginning, followed by a 0x00 char, which usually nobody could really display and then some times followed by garbage chars (if you’re lucky with readable characters).

I made the string reader terminate the string as soon as it comes across the 0x00 byte (NULL-Byte).

Now Sebastian and I are arguing a bit for and against this. He thinks I broke the RandomPackets tests (which I obviously did and we’re currently trying to figure out why the test didn’t fail on the CI … or more, why it didn’t run at all)

I guess one option is to bring our mspec options into the game and simply define an option wehere we decide how we want to have string encoding handled. Question however ist: Which should be the default encoding?

I would like to argue for 0-terminated strings being the default as this is what I have seen being used in every protocol I have come across.

Sebasitan argues that simply interpreting all bytes of the stirng as a string should be the default.


Now’s the time to ask what you folks think.

Chris

Re: [DISCUSS] Handling string fields

Posted by Christofer Dutz <ch...@c-ware.de>.

So as a little experiment I added a system property, that the RandomPackagesTest can set, which disables the 0-termination of strings and with this it seems to be generally working again.

(This is all in my PR branch)

Chris

From: Christofer Dutz <ch...@c-ware.de>
Date: Tuesday, 8. November 2022 at 09:10
To: dev@plc4x.apache.org <de...@plc4x.apache.org>
Subject: Re: [DISCUSS] Handling string fields
Ok .. so I also took this question to linkedin in the hope of hearing more about how PLCs really use strings:
https://www.linkedin.com/feed/update/urn%3Ali%3Aactivity%3A6995486785375981568/

Admittedly the number of replies aren’t too much yet as I just posted it yesterday evening.

At least one person seems to confirm my gut-feeling, that when writing to a STRING field most PLCs simply write the string, add the terminating 0x00 if there’s space left and simply leave the rest unchanged. So there’s simply old-junk after the 0x00.

Let’s see if there’s more input to come.

Right now I think one way we could habe both working (A clean string and Sebastian’s pcap based tests working), would be to set an environment variable to turn off the string termination for usage in tests, as this seems to be the only sensible usecase for wanting to keep the junk (if it is junk)

Chris

From: Sebastian Rühl <sr...@apache.org>
Date: Monday, 7. November 2022 at 15:14
To: dev@plc4x.apache.org <de...@plc4x.apache.org>
Subject: Re: [DISCUSS] Handling string fields
To give a little context:

In BACnet I have roundtrip tests implemented which takes some bytes, parses them and the serialize again. Here we are losing data because the 0x0 is just omitted and then never written again.

IMHO null terminated strings are something special (https://en.wikipedia.org/wiki/Null-terminated_string) and not related to the encoding itself (see section on character encoding on that page). For example Java is handling a special version of UTF-8 (https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) which re-encodes 0 to not confuse null-terminated readers for backward compability. ASCII and UTF-8 itself are perfectly fine with 0x00 bytes in a string.
Also in my opinion the fact that 0-terminated strings are used belong to the protocol specific configuration and should be declared when strings are used or on top level.

Coming back to BACnet for example: I know the encoding and the length. And if it says it is UTF-8 (0x0) for example with a length of 6 I want to read a string with length of 6 and not randomly only 5 because some random vendor is using a null terminated string. So I want to avoid losing data.

As I said for me it should be part of the protocol spec and not low level hard coded into the ByteBased readers/writers.

Sebastian

On 2022/11/07 13:59:53 Christofer Dutz wrote:
> Hi all,
>
> I’m currently having a bit of a discussion with Sebastian. I would like to hear your opinion on this.
>
> I recently updated the reader and writer to handle strings differently.
>
> Previously we simply gobbled up all bytes and converted that into a string. This resulted in the desired string to be located in the beginning, followed by a 0x00 char, which usually nobody could really display and then some times followed by garbage chars (if you’re lucky with readable characters).
>
> I made the string reader terminate the string as soon as it comes across the 0x00 byte (NULL-Byte).
>
> Now Sebastian and I are arguing a bit for and against this. He thinks I broke the RandomPackets tests (which I obviously did and we’re currently trying to figure out why the test didn’t fail on the CI … or more, why it didn’t run at all)
>
> I guess one option is to bring our mspec options into the game and simply define an option wehere we decide how we want to have string encoding handled. Question however ist: Which should be the default encoding?
>
> I would like to argue for 0-terminated strings being the default as this is what I have seen being used in every protocol I have come across.
>
> Sebasitan argues that simply interpreting all bytes of the stirng as a string should be the default.
>
>
> Now’s the time to ask what you folks think.
>
> Chris
>

Re: [DISCUSS] Handling string fields

Posted by Christofer Dutz <ch...@c-ware.de>.

Ok .. so I also took this question to linkedin in the hope of hearing more about how PLCs really use strings:
https://www.linkedin.com/feed/update/urn%3Ali%3Aactivity%3A6995486785375981568/

Admittedly the number of replies aren’t too much yet as I just posted it yesterday evening.

At least one person seems to confirm my gut-feeling, that when writing to a STRING field most PLCs simply write the string, add the terminating 0x00 if there’s space left and simply leave the rest unchanged. So there’s simply old-junk after the 0x00.

Let’s see if there’s more input to come.

Right now I think one way we could habe both working (A clean string and Sebastian’s pcap based tests working), would be to set an environment variable to turn off the string termination for usage in tests, as this seems to be the only sensible usecase for wanting to keep the junk (if it is junk)

Chris

From: Sebastian Rühl <sr...@apache.org>
Date: Monday, 7. November 2022 at 15:14
To: dev@plc4x.apache.org <de...@plc4x.apache.org>
Subject: Re: [DISCUSS] Handling string fields
To give a little context:

In BACnet I have roundtrip tests implemented which takes some bytes, parses them and the serialize again. Here we are losing data because the 0x0 is just omitted and then never written again.

IMHO null terminated strings are something special (https://en.wikipedia.org/wiki/Null-terminated_string) and not related to the encoding itself (see section on character encoding on that page). For example Java is handling a special version of UTF-8 (https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) which re-encodes 0 to not confuse null-terminated readers for backward compability. ASCII and UTF-8 itself are perfectly fine with 0x00 bytes in a string.
Also in my opinion the fact that 0-terminated strings are used belong to the protocol specific configuration and should be declared when strings are used or on top level.

Coming back to BACnet for example: I know the encoding and the length. And if it says it is UTF-8 (0x0) for example with a length of 6 I want to read a string with length of 6 and not randomly only 5 because some random vendor is using a null terminated string. So I want to avoid losing data.

As I said for me it should be part of the protocol spec and not low level hard coded into the ByteBased readers/writers.

Sebastian

On 2022/11/07 13:59:53 Christofer Dutz wrote:
> Hi all,
>
> I’m currently having a bit of a discussion with Sebastian. I would like to hear your opinion on this.
>
> I recently updated the reader and writer to handle strings differently.
>
> Previously we simply gobbled up all bytes and converted that into a string. This resulted in the desired string to be located in the beginning, followed by a 0x00 char, which usually nobody could really display and then some times followed by garbage chars (if you’re lucky with readable characters).
>
> I made the string reader terminate the string as soon as it comes across the 0x00 byte (NULL-Byte).
>
> Now Sebastian and I are arguing a bit for and against this. He thinks I broke the RandomPackets tests (which I obviously did and we’re currently trying to figure out why the test didn’t fail on the CI … or more, why it didn’t run at all)
>
> I guess one option is to bring our mspec options into the game and simply define an option wehere we decide how we want to have string encoding handled. Question however ist: Which should be the default encoding?
>
> I would like to argue for 0-terminated strings being the default as this is what I have seen being used in every protocol I have come across.
>
> Sebasitan argues that simply interpreting all bytes of the stirng as a string should be the default.
>
>
> Now’s the time to ask what you folks think.
>
> Chris
>

Re: [DISCUSS] Handling string fields

Posted by Sebastian Rühl <sr...@apache.org>.

To give a little context:

In BACnet I have roundtrip tests implemented which takes some bytes, parses them and the serialize again. Here we are losing data because the 0x0 is just omitted and then never written again.

IMHO null terminated strings are something special (https://en.wikipedia.org/wiki/Null-terminated_string) and not related to the encoding itself (see section on character encoding on that page). For example Java is handling a special version of UTF-8 (https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) which re-encodes 0 to not confuse null-terminated readers for backward compability. ASCII and UTF-8 itself are perfectly fine with 0x00 bytes in a string.
Also in my opinion the fact that 0-terminated strings are used belong to the protocol specific configuration and should be declared when strings are used or on top level.

Coming back to BACnet for example: I know the encoding and the length. And if it says it is UTF-8 (0x0) for example with a length of 6 I want to read a string with length of 6 and not randomly only 5 because some random vendor is using a null terminated string. So I want to avoid losing data.

As I said for me it should be part of the protocol spec and not low level hard coded into the ByteBased readers/writers.

Sebastian 

On 2022/11/07 13:59:53 Christofer Dutz wrote:
> Hi all,
> 
> I’m currently having a bit of a discussion with Sebastian. I would like to hear your opinion on this.
> 
> I recently updated the reader and writer to handle strings differently.
> 
> Previously we simply gobbled up all bytes and converted that into a string. This resulted in the desired string to be located in the beginning, followed by a 0x00 char, which usually nobody could really display and then some times followed by garbage chars (if you’re lucky with readable characters).
> 
> I made the string reader terminate the string as soon as it comes across the 0x00 byte (NULL-Byte).
> 
> Now Sebastian and I are arguing a bit for and against this. He thinks I broke the RandomPackets tests (which I obviously did and we’re currently trying to figure out why the test didn’t fail on the CI … or more, why it didn’t run at all)
> 
> I guess one option is to bring our mspec options into the game and simply define an option wehere we decide how we want to have string encoding handled. Question however ist: Which should be the default encoding?
> 
> I would like to argue for 0-terminated strings being the default as this is what I have seen being used in every protocol I have come across.
> 
> Sebasitan argues that simply interpreting all bytes of the stirng as a string should be the default.
> 
> 
> Now’s the time to ask what you folks think.
> 
> Chris
>