You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2021/01/08 14:18:03 UTC

Timezones in DFDL

I was confirming that DAFFODIL-1580 [1] is still an issue and was going
to open a bug with ICU, but as I look more at this, I think this is just
a limitation with timezones and DFDL, but wanted confirmation first.

For example, we have a test schema that looks like this:

 <xs:element name="time" type="xs:dateTime"
   dfdl:calendarPattern="hh:mm.VVVV" ... />

And matching data that looks like this:

  8:43.Los Angeles Time

This parses to an infoset that looks like this:

  <time>08:43:00-08:00</time>

And that infoset unparses to this:

  08:43.GMT-08:00

Note that the unparsed timezone does not match the original data.
DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
ICU) but I think this is actually expected behavior. A DFDL infoset does
not contain any location-specific timezone information--it only contains
a GMT offset (a restriction of XML Schema). So this data will always
unparse to a non-location specific timezone, depending on the calendar
pattern. For some patterns this will be an offset or a generic timezone
like PST (which should both roundtrip fine), but others might result in
"Unknown" or "unk". I think this only affects the "V" and "v" calendar
patterns, but additional tests should be added to confirm this behavior.

This is the expected behavior, correct?


[1] https://issues.apache.org/jira/browse/DAFFODIL-1580

Re: Timezones in DFDL

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
DFDL Spec says:

When parsing, for any pattern that omits components the values for the omitted components are supplied from the Unix epoch 1970-01-01T00:00:00.000.<https://opengridforum.github.io/DFDL/working-drafts/gwdrp-dfdl-v1.0.5-r35.htm#_ftn44>

So if a date is needed to determine DST or not, it's 1970-01-01

However, that's still not enough. You need location to determine time zone.

In the southern hemisphere countries that do the DST thing do it in the northern hemisphere winter.

So DST or not on Jan 01 1970 depends on whether you are in canada or chile.

Hence, time zone specifiers that contain location carry more information than once they are normalized to just time-zone offsets from UTC.



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, January 8, 2021 11:06 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Timezones in DFDL

That's a good point. I'm not sure if DFDL describes how this should be
handled. I think for the most part, DFDL just says the correct behavior
when dealing with dates/times is whatever ICU does. It seems if ICU is
not given with a timezone, then it just assumes the offset is the
standard timezone and not the daylight timezone, so maybe that's the
expected behavior.

DFDL does describe a "calendarObserveDST" property (which I notice
Daffodil doesn't implement, we need to implement that or add it to our
unsupported features page), though based on the brief description I'm
not sure if that property applies in this case.

Mike, any insight on what the spec says/implies about time zones without
dates? I don't see anything obvious in a quick scan.

On 1/8/21 10:34 AM, Dave Fisher wrote:
> There is a deeper problem with this example. It is a dateTime without a date and loses the nuance between when Los Angeles is GMT-7 and GMT-8.
>
> Sent from my iPhone
>
>> On Jan 8, 2021, at 6:48 AM, Beckerle, Mike <mb...@owlcyberdefense.com> wrote:
>>
>> I believe SL is correct. This is as expected.  This is data canonicalization, which is very typically what happens when a parser tolerates many diverse formats, but the data format doesn't capture which of those specifically. They're considered, by the DFDL schema, to be 100% equivalent. The output when unparsing is then the canonical representation of that information.
>>
>> In general to deal with this we use roundTrip="twoPass" tests in TDML. You probably need to change some one-pass tests to two-pass.
>>
>> That way it parses, unparses (to the canonical representation) then parses again and compares infosets. At that second parse, it will get the same infoset from 8:43.Los Angeles Time as from 8:43.GMT-08:00 so the test will pass.
>>
>>
>> ________________________________
>> From: Larry Barber <la...@nteligen.com>
>> Sent: Friday, January 8, 2021 9:36 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: RE: Timezones in DFDL
>>
>> This reminds me of the case where there are multiple possible delimiters - the one provided in the original file may not be the one that appears in the unparse output.
>>
>> -----Original Message-----
>> From: Steve Lawrence [mailto:slawrence@apache.org]
>> Sent: Friday, January 8, 2021 9:18 AM
>> To: dev@daffodil.apache.org
>> Subject: Timezones in DFDL
>>
>> I was confirming that DAFFODIL-1580 [1] is still an issue and was going to open a bug with ICU, but as I look more at this, I think this is just a limitation with timezones and DFDL, but wanted confirmation first.
>>
>> For example, we have a test schema that looks like this:
>>
>> <xs:element name="time" type="xs:dateTime"
>>   dfdl:calendarPattern="hh:mm.VVVV" ... />
>>
>> And matching data that looks like this:
>>
>>  8:43.Los Angeles Time
>>
>> This parses to an infoset that looks like this:
>>
>>  <time>08:43:00-08:00</time>
>>
>> And that infoset unparses to this:
>>
>>  08:43.GMT-08:00
>>
>> Note that the unparsed timezone does not match the original data.
>> DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
>> ICU) but I think this is actually expected behavior. A DFDL infoset does not contain any location-specific timezone information--it only contains a GMT offset (a restriction of XML Schema). So this data will always unparse to a non-location specific timezone, depending on the calendar pattern. For some patterns this will be an offset or a generic timezone like PST (which should both roundtrip fine), but others might result in "Unknown" or "unk". I think this only affects the "V" and "v" calendar patterns, but additional tests should be added to confirm this behavior.
>>
>> This is the expected behavior, correct?
>>
>>
>> [1] https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FDAFFODIL-1580&amp;data=04%7C01%7Clarry.barber%40nteligen.com%7Ced53e6d6768e41dca6ec08d8b3e041ae%7C379c214c5c944e86a6062d047675f02a%7C0%7C0%7C637457123063759940%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hV8nkoGDxv039R6ZVkVwfYB%2BaUAIG3YLt3aRfebTrMI%3D&amp;reserved=0
>


Re: Timezones in DFDL

Posted by Steve Lawrence <sl...@apache.org>.
That's a good point. I'm not sure if DFDL describes how this should be
handled. I think for the most part, DFDL just says the correct behavior
when dealing with dates/times is whatever ICU does. It seems if ICU is
not given with a timezone, then it just assumes the offset is the
standard timezone and not the daylight timezone, so maybe that's the
expected behavior.

DFDL does describe a "calendarObserveDST" property (which I notice
Daffodil doesn't implement, we need to implement that or add it to our
unsupported features page), though based on the brief description I'm
not sure if that property applies in this case.

Mike, any insight on what the spec says/implies about time zones without
dates? I don't see anything obvious in a quick scan.

On 1/8/21 10:34 AM, Dave Fisher wrote:
> There is a deeper problem with this example. It is a dateTime without a date and loses the nuance between when Los Angeles is GMT-7 and GMT-8.
> 
> Sent from my iPhone
> 
>> On Jan 8, 2021, at 6:48 AM, Beckerle, Mike <mb...@owlcyberdefense.com> wrote:
>>
>> I believe SL is correct. This is as expected.  This is data canonicalization, which is very typically what happens when a parser tolerates many diverse formats, but the data format doesn't capture which of those specifically. They're considered, by the DFDL schema, to be 100% equivalent. The output when unparsing is then the canonical representation of that information.
>>
>> In general to deal with this we use roundTrip="twoPass" tests in TDML. You probably need to change some one-pass tests to two-pass.
>>
>> That way it parses, unparses (to the canonical representation) then parses again and compares infosets. At that second parse, it will get the same infoset from 8:43.Los Angeles Time as from 8:43.GMT-08:00 so the test will pass.
>>
>>
>> ________________________________
>> From: Larry Barber <la...@nteligen.com>
>> Sent: Friday, January 8, 2021 9:36 AM
>> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
>> Subject: RE: Timezones in DFDL
>>
>> This reminds me of the case where there are multiple possible delimiters - the one provided in the original file may not be the one that appears in the unparse output.
>>
>> -----Original Message-----
>> From: Steve Lawrence [mailto:slawrence@apache.org]
>> Sent: Friday, January 8, 2021 9:18 AM
>> To: dev@daffodil.apache.org
>> Subject: Timezones in DFDL
>>
>> I was confirming that DAFFODIL-1580 [1] is still an issue and was going to open a bug with ICU, but as I look more at this, I think this is just a limitation with timezones and DFDL, but wanted confirmation first.
>>
>> For example, we have a test schema that looks like this:
>>
>> <xs:element name="time" type="xs:dateTime"
>>   dfdl:calendarPattern="hh:mm.VVVV" ... />
>>
>> And matching data that looks like this:
>>
>>  8:43.Los Angeles Time
>>
>> This parses to an infoset that looks like this:
>>
>>  <time>08:43:00-08:00</time>
>>
>> And that infoset unparses to this:
>>
>>  08:43.GMT-08:00
>>
>> Note that the unparsed timezone does not match the original data.
>> DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
>> ICU) but I think this is actually expected behavior. A DFDL infoset does not contain any location-specific timezone information--it only contains a GMT offset (a restriction of XML Schema). So this data will always unparse to a non-location specific timezone, depending on the calendar pattern. For some patterns this will be an offset or a generic timezone like PST (which should both roundtrip fine), but others might result in "Unknown" or "unk". I think this only affects the "V" and "v" calendar patterns, but additional tests should be added to confirm this behavior.
>>
>> This is the expected behavior, correct?
>>
>>
>> [1] https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FDAFFODIL-1580&amp;data=04%7C01%7Clarry.barber%40nteligen.com%7Ced53e6d6768e41dca6ec08d8b3e041ae%7C379c214c5c944e86a6062d047675f02a%7C0%7C0%7C637457123063759940%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hV8nkoGDxv039R6ZVkVwfYB%2BaUAIG3YLt3aRfebTrMI%3D&amp;reserved=0
> 


Re: Timezones in DFDL

Posted by Dave Fisher <wa...@apache.org>.
There is a deeper problem with this example. It is a dateTime without a date and loses the nuance between when Los Angeles is GMT-7 and GMT-8.

Sent from my iPhone

> On Jan 8, 2021, at 6:48 AM, Beckerle, Mike <mb...@owlcyberdefense.com> wrote:
> 
> I believe SL is correct. This is as expected.  This is data canonicalization, which is very typically what happens when a parser tolerates many diverse formats, but the data format doesn't capture which of those specifically. They're considered, by the DFDL schema, to be 100% equivalent. The output when unparsing is then the canonical representation of that information.
> 
> In general to deal with this we use roundTrip="twoPass" tests in TDML. You probably need to change some one-pass tests to two-pass.
> 
> That way it parses, unparses (to the canonical representation) then parses again and compares infosets. At that second parse, it will get the same infoset from 8:43.Los Angeles Time as from 8:43.GMT-08:00 so the test will pass.
> 
> 
> ________________________________
> From: Larry Barber <la...@nteligen.com>
> Sent: Friday, January 8, 2021 9:36 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: RE: Timezones in DFDL
> 
> This reminds me of the case where there are multiple possible delimiters - the one provided in the original file may not be the one that appears in the unparse output.
> 
> -----Original Message-----
> From: Steve Lawrence [mailto:slawrence@apache.org]
> Sent: Friday, January 8, 2021 9:18 AM
> To: dev@daffodil.apache.org
> Subject: Timezones in DFDL
> 
> I was confirming that DAFFODIL-1580 [1] is still an issue and was going to open a bug with ICU, but as I look more at this, I think this is just a limitation with timezones and DFDL, but wanted confirmation first.
> 
> For example, we have a test schema that looks like this:
> 
> <xs:element name="time" type="xs:dateTime"
>   dfdl:calendarPattern="hh:mm.VVVV" ... />
> 
> And matching data that looks like this:
> 
>  8:43.Los Angeles Time
> 
> This parses to an infoset that looks like this:
> 
>  <time>08:43:00-08:00</time>
> 
> And that infoset unparses to this:
> 
>  08:43.GMT-08:00
> 
> Note that the unparsed timezone does not match the original data.
> DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
> ICU) but I think this is actually expected behavior. A DFDL infoset does not contain any location-specific timezone information--it only contains a GMT offset (a restriction of XML Schema). So this data will always unparse to a non-location specific timezone, depending on the calendar pattern. For some patterns this will be an offset or a generic timezone like PST (which should both roundtrip fine), but others might result in "Unknown" or "unk". I think this only affects the "V" and "v" calendar patterns, but additional tests should be added to confirm this behavior.
> 
> This is the expected behavior, correct?
> 
> 
> [1] https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FDAFFODIL-1580&amp;data=04%7C01%7Clarry.barber%40nteligen.com%7Ced53e6d6768e41dca6ec08d8b3e041ae%7C379c214c5c944e86a6062d047675f02a%7C0%7C0%7C637457123063759940%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hV8nkoGDxv039R6ZVkVwfYB%2BaUAIG3YLt3aRfebTrMI%3D&amp;reserved=0


Re: Timezones in DFDL

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
I believe SL is correct. This is as expected.  This is data canonicalization, which is very typically what happens when a parser tolerates many diverse formats, but the data format doesn't capture which of those specifically. They're considered, by the DFDL schema, to be 100% equivalent. The output when unparsing is then the canonical representation of that information.

In general to deal with this we use roundTrip="twoPass" tests in TDML. You probably need to change some one-pass tests to two-pass.

That way it parses, unparses (to the canonical representation) then parses again and compares infosets. At that second parse, it will get the same infoset from 8:43.Los Angeles Time as from 8:43.GMT-08:00 so the test will pass.


________________________________
From: Larry Barber <la...@nteligen.com>
Sent: Friday, January 8, 2021 9:36 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: RE: Timezones in DFDL

This reminds me of the case where there are multiple possible delimiters - the one provided in the original file may not be the one that appears in the unparse output.

-----Original Message-----
From: Steve Lawrence [mailto:slawrence@apache.org]
Sent: Friday, January 8, 2021 9:18 AM
To: dev@daffodil.apache.org
Subject: Timezones in DFDL

I was confirming that DAFFODIL-1580 [1] is still an issue and was going to open a bug with ICU, but as I look more at this, I think this is just a limitation with timezones and DFDL, but wanted confirmation first.

For example, we have a test schema that looks like this:

 <xs:element name="time" type="xs:dateTime"
   dfdl:calendarPattern="hh:mm.VVVV" ... />

And matching data that looks like this:

  8:43.Los Angeles Time

This parses to an infoset that looks like this:

  <time>08:43:00-08:00</time>

And that infoset unparses to this:

  08:43.GMT-08:00

Note that the unparsed timezone does not match the original data.
DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
ICU) but I think this is actually expected behavior. A DFDL infoset does not contain any location-specific timezone information--it only contains a GMT offset (a restriction of XML Schema). So this data will always unparse to a non-location specific timezone, depending on the calendar pattern. For some patterns this will be an offset or a generic timezone like PST (which should both roundtrip fine), but others might result in "Unknown" or "unk". I think this only affects the "V" and "v" calendar patterns, but additional tests should be added to confirm this behavior.

This is the expected behavior, correct?


[1] https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FDAFFODIL-1580&amp;data=04%7C01%7Clarry.barber%40nteligen.com%7Ced53e6d6768e41dca6ec08d8b3e041ae%7C379c214c5c944e86a6062d047675f02a%7C0%7C0%7C637457123063759940%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hV8nkoGDxv039R6ZVkVwfYB%2BaUAIG3YLt3aRfebTrMI%3D&amp;reserved=0

RE: Timezones in DFDL

Posted by Larry Barber <la...@nteligen.com>.
This reminds me of the case where there are multiple possible delimiters - the one provided in the original file may not be the one that appears in the unparse output.

-----Original Message-----
From: Steve Lawrence [mailto:slawrence@apache.org] 
Sent: Friday, January 8, 2021 9:18 AM
To: dev@daffodil.apache.org
Subject: Timezones in DFDL

I was confirming that DAFFODIL-1580 [1] is still an issue and was going to open a bug with ICU, but as I look more at this, I think this is just a limitation with timezones and DFDL, but wanted confirmation first.

For example, we have a test schema that looks like this:

 <xs:element name="time" type="xs:dateTime"
   dfdl:calendarPattern="hh:mm.VVVV" ... />

And matching data that looks like this:

  8:43.Los Angeles Time

This parses to an infoset that looks like this:

  <time>08:43:00-08:00</time>

And that infoset unparses to this:

  08:43.GMT-08:00

Note that the unparsed timezone does not match the original data.
DAFFODIL-1580 describes this behavior as a bug (either in Daffodil or
ICU) but I think this is actually expected behavior. A DFDL infoset does not contain any location-specific timezone information--it only contains a GMT offset (a restriction of XML Schema). So this data will always unparse to a non-location specific timezone, depending on the calendar pattern. For some patterns this will be an offset or a generic timezone like PST (which should both roundtrip fine), but others might result in "Unknown" or "unk". I think this only affects the "V" and "v" calendar patterns, but additional tests should be added to confirm this behavior.

This is the expected behavior, correct?


[1] https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FDAFFODIL-1580&amp;data=04%7C01%7Clarry.barber%40nteligen.com%7Ced53e6d6768e41dca6ec08d8b3e041ae%7C379c214c5c944e86a6062d047675f02a%7C0%7C0%7C637457123063759940%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hV8nkoGDxv039R6ZVkVwfYB%2BaUAIG3YLt3aRfebTrMI%3D&amp;reserved=0