You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Steve Lawrence (Jira)" <ji...@apache.org> on 2023/06/29 15:48:00 UTC

[jira] [Commented] (DAFFODIL-2823) time zone "z" specifier does not work properly

    [ https://issues.apache.org/jira/browse/DAFFODIL-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738626#comment-17738626 ] 

Steve Lawrence commented on DAFFODIL-2823:
------------------------------------------

I've confirmed that we are setting the locale correctly on the calendar, so I don't think that is the issue.

TLDR: I think is is somewhat complex to unparse to a locale specific timezone, IBM DFDL has the same behavior, suggest we close as expected behavior.
----
XML schema only allows timezone offsets (e.g. GMT-05:00) so when unparsing, our calendar is just set with an offset timezone and ICU doesn't localize that. That's possibly a bug in ICU, but it's also possible it's intended behavior. And I think the reason its maybe intended is because it's actually pretty difficult to figure out the right timezone (timezones are really complicated).

If we wanted to fix this overselves since ICU doesn't do it, we need to use the offset to find the "best" timezone, set that on the calendar, and then things unparse as this issue expects. The best algorithm I've come up with so far to convert a GMT offset to the "best" timezone is below:

There are 2 ICU4J API's to get all TimeZones with a given offset, with and without a country code:
{code:java}
static Set<String> 	getAvailableIDs(TimeZone.SystemTimeZoneType zoneType, String region, Integer rawOffset)
static String[] 	getAvailableIDs(int rawOffset)
{code}
The problem is these return a bunch of different possibilities. For example, an offset of -18000000 (GMT-5 in milliseconds) with a region of "US") gives these possibilities:
{code:scala}
scala> val ids = TimeZone.getAvailableIDs(TimeZone.SystemTimeZoneType.CANONICAL, "US", -18000000).asScala
ids: scala.collection.mutable.Set[String] = Set(
America/Detroit,
America/Indiana/Marengo,
America/Indiana/Petersburg,
America/Indiana/Vevay,
America/Indiana/Vincennes,
America/Indiana/Winamac,
America/Indianapolis,
America/Kentucky/Monticello,
America/Louisville,
America/New_York)
{code}
Picking the "right" one is not obvious, and ICU doesn't easily provide that information that I can find.

However, I've found that you can get the "meta zone IDs" for each time zone, each of which has a "reference" or "golden" ID associated with it, and the golden zone is probably a reasonable "right" timezone. For example:
{code:scala}
scala> val tzn = TimeZoneNames.getInstance(new ULocale("en_US"))
scala> val metaMap = ids.map { id => id -> tzn.getAvailableMetaZoneIDs(id).asScala }.toMap
metaMap: scala.collection.immutable.Map[String,scala.collection.mutable.Set[String]] = Map(
America/Indianapolis -> Set(America_Eastern),
America/New_York -> Set(America_Eastern),
America/Detroit -> Set(America_Eastern),
America/Indiana/Winamac -> Set(America_Eastern, America_Central),
America/Indiana/Vevay -> Set(America_Eastern),
America/Indiana/Petersburg -> Set(America_Eastern, America_Central),
America/Kentucky/Monticello -> Set(America_Eastern, America_Central),
America/Louisville -> Set(America_Eastern, America_Central),
America/Indiana/Vincennes -> Set(America_Eastern, America_Central),
America/Indiana/Marengo -> Set(America_Eastern, America_Central))
{code}
Note that all the meta zones are "America_Eastern" or "America_Central", or both since some of these time zones switched between different meta zones where a -5 offset could be either eastern or central depending on the date. If we convert all the meta zones to their "reference" or "golden" zone, we get closer to a "best" timezone.
{code:scala}
scala> val goldenMap = metaMap.mapValues { metas => metas.map { meta => tzn.getReferenceZoneID(meta, "US") }}
goldenMap: scala.collection.immutable.Map[String,scala.collection.mutable.Set[String]] = Map(
America/Indianapolis -> Set(America/New_York),
America/New_York -> Set(America/New_York),
America/Detroit -> Set(America/New_York),
America/Indiana/Winamac -> Set(America/New_York, America/Chicago),
America/Indiana/Vevay -> Set(America/New_York),
America/Indiana/Petersburg -> Set(America/New_York, America/Chicago),
America/Kentucky/Monticello -> Set(America/New_York, America/Chicago),
America/Louisville -> Set(America/New_York, America/Chicago),
America/Indiana/Vincennes -> Set(America/New_York, America/Chicago),
America/Indiana/Marengo -> Set(America/New_York, America/Chicago))
{code}
So now for each timezone that has a -5 offset, we have the possible reference/golden zones, either America/New_York or America/Chicago. Maybe the "best" is the time zone that is its own reference zone, so we can filter like this:
{code:scala}
scala> val timeZone = goldenMap.filter { case (key, values) => values.size == 1 && key == values.head }.keys
timeZone: Iterable[String] = Set(America/New_York)
{code}
We now have a single timezone of "America/New York", which is probably a reasonable time zone to pick for a -5 offset in the US locale.

Unfortunately, things get more complicated if the locale does not have a region. For example, if we use a locale of just "en" and a reference zone region of "001" (the world region), and the alternate getAvailableIDs method since we don't have a country code, the results look like this:
{code:scala}
scala> val ids = TimeZone.getAvailableIDs(-18000000).toSeq
scala> val tzn = TimeZoneNames.getInstance(new ULocale("en"))
scala> val metaMap = ids.map { id => id -> tzn.getAvailableMetaZoneIDs(id).asScala }.toMap
scala> val goldenMap = metaMap.mapValues { metas => metas.map { meta => tzn.getReferenceZoneID(meta, "001") }}
scala> val timeZone = goldenMap.filter { case (key, values) => values.size == 1 && key == values.head }.keys
timeZone: Iterable[String] = Set(
America/New_York,
America/Guayaquil,
America/Lima,
America/Bogota,
America/Havana)
{code}
So with no locale (e.g. dfdl:calendarLanguage="en" which is the default in our DFDLGeneralFormat), a GMT-5 timezone with this logic could be one of a number of golden timezones, with no way to pick a "best" one. Maybe in the case of not having a country, or the timezone resolving to multiple "golden" zones we just bail and use the GMT- offset.

Note that this is a lot of work just to figure out the "best" timezone for an offset. This could probably be made more efficient or even pre-calculated–the golden timezone isn't really going to change for an offset unless locale changes a lot. But even then it isn't guaranteed to be what the user expected.

Maybe we punt and say that canonicalizing time zones to a GMT offset is what we do, and if the user wants to be guaranteed to keep the localized timezone then it needs to be modeled as a separate string element with restrictions to validate they are expected timezones, and not parse it as part of the calendar pattern?

Note that IBM DFDL has the same behavior as Daffodil. Data with an "EST" timezone parses to GMT-05:00 in the infoset, and that unparses to "GMT-5", even with dfdl:calendarLanguage="en_US" and "vvv" in the pattern.

Due to this and the complexity and getting a "right" answer discussed above, I'm leaning towards resolving this as expected behavior.

> time zone "z" specifier does not work properly
> ----------------------------------------------
>
>                 Key: DAFFODIL-2823
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2823
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>    Affects Versions: 3.4.0
>            Reporter: Mike Beckerle
>            Priority: Minor
>
> The time zone string  "EST" is parsed by calendarPattern charater "z", and becomes ISO standard "-05:00" on parse, but "GMT-5" on unparse.
> This happens despite dfdl:calendarLanguge="en_US". That is, the problem is that the unparse should produce "EST", but does not. 
> The locale is needed to implement "z" time zone format because some of these 3-letter timezone specifiers are ambiguous, so the locale must be known to disambiguate them when parsing. 
> For example CST can be "Central Standard Time" (North America), "Cuba Standard Time" or "China Standard Time", and is widely used for "Central Standard Time" (Australia) according to [https://en.wikipedia.org/wiki/List_of_time_zone_abbreviations]
> So when parsing "z" one must use the locale. When unparsing it's not so ambiguous, but Daffodil doesn't appear to use the locale information.
> Per the ICU documentation here:  [https://icu.unicode.org/design/formatting/timezone/icu-4-8-time-zone-names]
> The "z" needs locale information, and several of the other specifiers also need it. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)