You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stdcxx.apache.org by Apache Wiki <wi...@apache.org> on 2008/03/11 19:38:15 UTC

[Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change notification.

The following page has been changed by MartinSebor:
http://wiki.apache.org/stdcxx/LocaleLookup

The comment on the change is:
Added more detail to The Plan. Minor edits in Problem Statement and Objective.

------------------------------------------------------------------------------
+ [[Anchor(Problem Statement)]]
  = Problem Statement =
  
  Modern operating systems provide support for dozens or even hundreds locales encoded in various codesets. The set of locales and codesets installed on a computer is typically determined by the system administrator at the time the operating system is installed. Although there are standards and conventions in place to establish a common set of locale names, due to historical reasons both locale and codeset names tend to vary from one implementation to another. Operating systems may provide the standard names as well as the traditional ones, with the former simply being aliases for the latter.
  
- The stdcxx test suite contains tests that exercise the behavior of the localization library. Since the set of installed locales may vary from server to server and since their names need not be consistent across different operating systems, the test stdcxx driver provides mechanisms to determine the names of all locales known to a system. For simplicity, many tests exercise the localization library using all these locale names. Other tests do so in an effort to exercise different code paths taken based on whether a locale uses a single-byte or multi-byte encoding. On systems with many installed locales running these tests may take a considerable amount of time and use up valuable system resources. For example, on AIX systems with all available locales installed running each test can take as much as an hour. In addition, since many of the locale names reference the same locale exercising all of them is wasteful. In addition, since many locales differ only in very minor detail
 s (e.g., the values of punctuator characters), exhaustively testing all of them ends up repeatedly executing the same code paths and is unnecessary.
+ The stdcxx test suite contains tests that exercise the behavior of the localization library. Since the set of installed locales may vary from server to server and since their names need not be consistent across different operating systems, the stdcxx test driver provides mechanisms to determine the names of all locales known to a system. For simplicity, many tests exercise the localization library using all these locale names. Other tests do so in an effort to exercise different code paths taken based on whether a locale uses a single-byte or multi-byte encoding. On systems with many installed locales running these tests may take a considerable amount of time and use up valuable system resources. For example, on AIX systems with all available locales installed running each test can take as much as an hour. In addition, since many of the locale names reference the same locale exercising all of them is wasteful. In addition, since many locales differ only in very minor detail
 s (e.g., the values of punctuator characters), exhaustively testing all of them ends up repeatedly executing the same code paths and is unnecessary.
  
+ [[Anchor(Objective)]]
  = Objective =
  
- The objective of this project is to provide an interface to make it easy to write localization tests without the knowledge of platform-specific details that provide sufficient code coverage and that complete in a reasonable amount of time (ideally seconds as opposed to minutes). The interface must make it easy to query the system for locales that satisfy the specific requirements of each test. For example, most tests that currently use all installed locales (e.g., the set of tests for the `std::ctype` facet) only need to exercise a representative sample of the installed locales without using the same locale more than once. Thus the interface will need to make it possible to specify such a sample. Another example is tests that attempt to exercise locales in multibyte encodings whose `MB_CUR_MAX` ranges from 1 to 6 (some of the `std::codecvt` facet tests). The new interface will need to make it easy to specify such a set of locales without explicitly naming them, and it will 
 need to retrieve such locales without returning duplicates.
+ The objective of this project is to provide an interface to make it easy to write localization tests without the knowledge of platform-specific details (such as locale names) that provide sufficient code coverage and that complete in a reasonable amount of time (ideally seconds as opposed to minutes). The interface must make it easy to query the system for locales that satisfy the specific requirements of each test. For example, most tests that currently use all installed locales (e.g., the set of tests for the `std::ctype` facet) only need to exercise a representative sample of the installed locales without using the same locale more than once. Thus the interface will need to make it possible to specify such a sample. Another example is tests that attempt to exercise locales in multibyte encodings whose `MB_CUR_MAX` ranges from 1 to 6 (some of the `std::codecvt` facet tests). The new interface will need to make it easy to specify such a set of locales without explicitly na
 ming them, and it will need to retrieve such locales without returning duplicates.
  
  [[Anchor(Definitions)]]
  = Definitions =
@@ -28, +30 @@

  [[Anchor(Plan)]]
  = Plan =
  
- This page relates to the issue described at http://issues.apache.org/jira/browse/STDCXX-608. There has been some discussion both on and off the dev@ list about how to proceed. This page is here to document what has been discussed.
+ This page relates to the issue described in [http://issues.apache.org/jira/browse/STDCXX-608 STDCXX-608]. There has been some discussion both on and off the dev@ list about how to proceed. This page is here to document what has been discussed.
  
- The plan is to take a regular expression like query string, do a brace expansion to get several simpler regular expressions, and then search the list of installed locales for matches.
+ The plan to meet the [#Objective Objective] is to provide an interface to query the set of installed locales based on a set of a small number of essential parameters  used by the localization tests. The interface should make it easy to express conjunction, disjunction, and negation of the terms (parameters) and support (a perhaps simplified version of) [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03 Basic Regular Expression] syntax. We've decided to use shell brace expansion as a means of expressing logical conjunction between terms: a valid brace expression is expanded to obtain a set of terms implicitly connected by a logical AND. Individual ('\n'-separated) lines of the query string are taken to be implicitly connected by a logical OR. This approach models the [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.html grep] interface with each line loosely corresponding to the argument of the `-e` option to `grep`.
  
  Given a query string 
  
@@ -82, +84 @@

     a 1 2 b
  }}}
  
- In most cases you would want to use rw_shell_expand(). '''Perhaps ''rw_brace_expand'' should become an implementation function and the header/source/test should be renamed to shellexp.h/shellexp.cpp/0.shellexp.cpp''' 
+ In most cases you would want to use `rw_shell_expand()`. '''Perhaps ''rw_brace_expand'' should become an implementation function and the header/source/test should be renamed to shellexp.h/shellexp.cpp/0.shellexp.cpp''' 
  
  [[Anchor(Part2)]]
  = Part 2 (STDCXX-715) =

Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Martin Sebor <se...@roguewave.com>.
Travis Vitek wrote:
> 
> 
> Martin Sebor wrote:
>> But we do need to come up with a sound specification of the query syntax
>> before implementing any more code.
>>
> 
> Okay, the proposed query syntax grammar essentially the same as that being
> used for the <config> value in xfail.txt. So we have
> 
>   <match> is a shell globbing pattern in the format below. All fields
>   are required.
> 
>   iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country
> code
>   iso-language ::= ISO-3166 two character language code
>   iana-codeset ::= IANA codeset name with '-' replaced or removed

Or escaped or quoted? E.g., UTF\-8 or "UTF-8" If it's all the same
to you I would prefer to keep the IANA names unchanged. A good
number of them use the dash to separate two numeric parts of the
name from each other (e.g., ISO-8859-1 and ISO-8859-13) so dropping
the dash would make it difficult to tell one from the other, and
replacing the dash would mean finding a suitable character for the
replacement that's not used in any of the names and that's easy
enough to remember (I suppose the equals sign might qualify if
we had to go that route).

> 
>   match        ::=
> <iso-language-expr>-<iso-country-expr>-<iana-codeset-expr>-<mb_cur_len-expr>
>   match_list   ::= match | match ' ' match_list
> 
> So the previous example to select `en_US.*' with a 1 byte encoding or
> `zh_*.UTF-8' with a 2, 3, or 4 byte encoding would use the following query
> string.
> 
>   en-US-*-1 zh-*-UTF8-2 zh-*-UTF8-3 zh-*-UTF8-4

Okay, this makes it clear that space is an OR. The AND is implicit
in the dash, and there's no need for the '\n'.

> 
> This long expression could be written using a brace expansion to simplify
> it.
> 
>   en-US-*-1 zh-*-UTF8-{2,3,4}
> 
> I propose that we not support the BRE syntax, simply because it is so
> complex.

Which part are you suggesting we not support? I ask because I don't
recall us talking about supporting the full BRE or anything beyond
the subset already implemented in rw_fnmatch().

> Yes, it might be quite easy to prototype a solution using grep and
> other shell utilities, but providing a complete implementatoin in C [where
> we actually need it] is going to be difficult at best. For what we need,
> shell globbing should be sufficient to handle the cases that we need to
> satisfy the objective.
> 
> I suppose you could consider en-US-*-1 is "language=en" and "country=US" and
> "codeset=*" and "mb_cur_len=1" so '-' represents an intersection operation,
> but I prefer to think of the entire expression to be either a match or not a
> match.

Sure. I personally don't see a difference between the two from
a practical POV.

> 
> 
> Martin Sebor wrote:
>> I think it's great
>> to put together a prototype at the same time, just as long as it's
>> understood that the prototype might need to change as we discover
>> flaws in it or better ways of doing it.
>>
> 
> I have no problem with flaws or small improvements. When we start talking
> about implementing a regular expression parser I get concerned.

I fully agree that implementing regular expressions just for this
project would be overkill. I don't think I ever suggested that we
implement BRE for this though. If I ever mentioned BRE (e.g., on
the wiki) I was referring to the subset used for fnmatch globbing.
If I somehow gave the impression that I was proposing we implement
it now I apologize for confusing things.

Martin

Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Travis Vitek <vi...@roguewave.com>.


Martin Sebor wrote:
> 
> But we do need to come up with a sound specification of the query syntax
> before implementing any more code.
> 

Okay, the proposed query syntax grammar essentially the same as that being
used for the <config> value in xfail.txt. So we have

  <match> is a shell globbing pattern in the format below. All fields
  are required.

  iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country
code
  iso-language ::= ISO-3166 two character language code
  iana-codeset ::= IANA codeset name with '-' replaced or removed

  match        ::=
<iso-language-expr>-<iso-country-expr>-<iana-codeset-expr>-<mb_cur_len-expr>
  match_list   ::= match | match ' ' match_list

So the previous example to select `en_US.*' with a 1 byte encoding or
`zh_*.UTF-8' with a 2, 3, or 4 byte encoding would use the following query
string.

  en-US-*-1 zh-*-UTF8-2 zh-*-UTF8-3 zh-*-UTF8-4

This long expression could be written using a brace expansion to simplify
it.

  en-US-*-1 zh-*-UTF8-{2,3,4}

I propose that we not support the BRE syntax, simply because it is so
complex. Yes, it might be quite easy to prototype a solution using grep and
other shell utilities, but providing a complete implementatoin in C [where
we actually need it] is going to be difficult at best. For what we need,
shell globbing should be sufficient to handle the cases that we need to
satisfy the objective.

I suppose you could consider en-US-*-1 is "language=en" and "country=US" and
"codeset=*" and "mb_cur_len=1" so '-' represents an intersection operation,
but I prefer to think of the entire expression to be either a match or not a
match.


Martin Sebor wrote:
> 
> I think it's great
> to put together a prototype at the same time, just as long as it's
> understood that the prototype might need to change as we discover
> flaws in it or better ways of doing it.
> 

I have no problem with flaws or small improvements. When we start talking
about implementing a regular expression parser I get concerned.

Travis
-- 
View this message in context: http://www.nabble.com/RE%3A--Stdcxx-Wiki--Update-of-%22LocaleLookup%22-by-MartinSebor-tp15992191p16089939.html
Sent from the stdcxx-dev mailing list archive at Nabble.com.


Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Martin Sebor <se...@roguewave.com>.
Travis, I don't think we've been wasting time. But we do need to come
up with a sound specification of the query syntax before implementing
any more code. Examples are helpful, but they are not a substitute for
a precise grammar and a description of the effects. I think it's great
to put together a prototype at the same time, just as long as it's
understood that the prototype might need to change as we discover
flaws in it or better ways of doing it.

Martin

Travis Vitek wrote:
>>
>>
>> Martin Sebor wrote:
>>  
>> Travis Vitek wrote:
>>>  
>>>
>>>> From: Apache Wiki [mailto:wikidiffs@apache.org] 
>>>>
>>>> The new 
>>>> interface will need to make it easy to specify such a set of 
>>>> locales without explicitly naming them, and it will need to
>>>> retrieve such locales without returning duplicates.
>>>>
>>> As mentioned before I don't know a good way to avoid duplicates other
>>> than to compare every attribute of each facet of each locale to all of
>>> the other locales. Just testing to see if the return from setlocale() is
>>> the same as the input string is not enough. The user could have intalled
>>> locales that have unique names but are copies of the data from some
>>> other locale.
>> True, but we don't care about how long the test might run on
>> some user's system. What we care about here is that *we* don't
>> run tests unnecessarily on our own build servers, and we can
>> safely make the simplifying assumption that there are no user
>> defined locales installed on them.
>>
>>>> The interface should make it easy to 
>>>> express conjunction, disjunction, and negation of the terms 
>>>> (parameters) and support (a perhaps simplified version of) 
>>>> [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_cha
>>>> p09.html#tag_09_03 Basic Regular Expression] syntax.
>>> Conjunction, disjunction and negation? Are you saying you want to be
>>> able to select all locales that are _not_ in some set, something like
>>> you would get with a caret (^} in a grep expression?
>> No, I meant something simple like grep -v.
>>
> 
> Okay, so this is an all-or-none type negation. I understand that, but I'm
> not sure if it is necessary given the objective.
> 
>>> I'm hoping that I'm just misunderstanding your comments. If not, then
>>> this is news to me and I'm a bit curious just how this addition is
>>> necessary to minimize the number of locales tested [i.e. the objective].
>> It may not be necessary. I included it for completeness, thinking
>> if it wasn't already there it could be easily added in the form
>> of an argument of the function. If it isn't there we can leave
>> it out until we need it.
>>
>>>> We've 
>>>> decided to use shell brace expansion as a means of expressing 
>>>> logical conjunction between terms: a valid brace expression is 
>>>> expanded to obtain a set of terms implicitly connected by a 
>>>> logical AND. Individual ('\n'-separated) lines of the query 
>>>> string are taken to be implicitly connected by a logical OR. 
>>>> This approach models the 
>>>> [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.h
>>>> tml grep] interface with each line loosely corresponding to 
>>>> the argument of the `-e` option to `grep`.
>>>>
>>> I've seen you mention the '\n' seperated list thing before, but I still
>>> can't make sense of it. Are you saying
>> In my mind the query expression consists of terms connected
>> by operators for conjunction, disjunction (let's forget about
>> negation for now). E.g., like so:
>>
>>    qry-expr ::= <dis-expr>
>>    dis-expr ::= <con-expr> | <dis-expr> <NL> <con-expr>
>>    con-expr ::= <term> | <term> <SP> <con-expr>
>>
>> For example:
>>
>>    "foo bar" is a con-expr of the terms "foo" and "bar" denoting
>>    the intersection of foo and bar, and
>>
>>    "123 xyz\nKLM" is a dis-expr of the terms "123 xyz" and "KLM"
>>    denoting the union of the the two terms. "123 xyz" is itself
>>    a con-expr denoting the intersection of 123 and xyz.
>>
>>> that to select `en_US.*' with a 1
>>> byte encoding or `zh_*.UTF-8' with a 2, 3, or 4 byte encoding, I would
>>> write the following query?
>> I think it might be simpler to keep things abstract but given my
>> specification above a simple query string would look like this:
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 2\n"
>>    "zh_*.UTF-8 3\n"
>>    "zh_*.UTF-8 4\n"
>>
>> for the equivalent of:
>>
>>       locale == "en_US.*"    && MB_CUR_MAX == 1
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 2
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 3
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 4
>>
> 
> I'm totally confused. If we're going to write out each of the expansions,
> then why did I take the time to implement brace expansion?
> 
> I thought the idea was to allow us to select locales using a brace expanded
> query string. If we are explicitly writing out each of the names, then we
> wasted a bunch of time writing brace expansion code.
> 
>> I'm not sure how you could use brace expressions here. Maybe it
>> should be the other way around (<SP> should be OR and <NL> AND).
>> But then the grep -e idea is out the window.
> 
> Well, if we're going down the road of rewriting this _again_ then how
> about using something like '&&' and '||', or even 'and' and 'or' for the
> logical operations and then '(' and ')' for grouping? Almost like the
> 'equivalent of' that you wrote above. Something that is readable by a
> C/C++ programmer or the average guy off of the street?
> 
> The truth is that not every guy knows grep, and I'm sure that those who
> do wouldn't expect to see a grammar that used '\n' and ' ' to represent
> logical operations.
> 
>> Or maybe we need
>> a way to denote/group terms. Then we might be able to say:
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 ({2..4})"
>>
>> expand it to
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 (2 3 4)"
>>
>> and "know" that the spaces in "2 3 4" denote OR even though the
>> space in "zh_*.UTF-8 (" denotes an AND. Ugh. I'm not sure I lik
>> this all that much better.
>>
>>>   const char* locales =
>>>       rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}", 10);
>>>
>>> I don't see why that would be necessary. You can do it with the
>>> following query using normal brace expansion, and it's human readable.
>>>
>>>   const char* locales =
>>>       rw_locale_query ("{en_US.* 1,zh_*.UTF-8 {2..4}}", 10);
>> What's "{en_US.* 1,zh_*.UTF-8 {2..4}}" supposed to expand to?
>> Bash 3.2 doesn't expand it. I suppose it could be
>>    "en_US.* 1 zh_*.UTF-8 2 3 4" or
>> "en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"
> 
> I believe that this is _exactly_ what you suggested in our meetings when
> I was in Boulder the last time. Maybe I'm just confused, but I am pretty
> sure that was what was presented.
> 
> The shell does whitespace collapse and tokenization before it does the
> expansion. To use whitespace in a brace expansion in the shell you have
> to escape it. 
> 
> So the following expansion should work just fine in csh...
> 
>   {en_US.*\ 1,zh_*.UTF-8\ {2..4}}
> 
> It should expand to
> 
>   en_US.* 1
>   zh_*.UTF-8 2
>   zh_*.UTF-8 3
>   zh_*.UTF-8 4
> 
> Remember that I originally provided rw_brace_expand() that doesn't do all
> of that. It treats whitespace like any other character. Of course if you
> insist on 100% compatibility with shell brace expansion, then feel free to
> escape the spaces. Personally I prefer strings without escapes.
> 
>> Either way I think I'm getting confused by the lack of distinction
>> between what's OR and what's AND.
>>
> 
> I give an example above of how a brace expansion already solves the
> problem.
> 
> If the brace expansion routine I've written returns a null terminated
> buffer of null terminated strings that are the brace expansions and we
> have a function for doing primitive string matching [rw_fnmatch], then
> this is a pretty simple problem to solve.
> 
> This is exactly what you are doing with the xfail.txt thing. The platform
> string is just a brace expansion and grep-like expression...
> 
>   aix-5.3-*-vacpp-9.0-{12,15}?
> 
> Why can't ours be seperated by spaces, or some other character? Is it so
> different?
> 
> I suppose the big difference is that the format above is rigid and well
> defined, whereas the locale match format is still in flux.
> 
>>> I know that the '\n' is how you'd use `grep -e', but does it really make
>>> sense? We aren't using `grep -e' here.
>> I'm trying to model the interface on something we all know how
>> to use. grep -e seemed the closest example of an known interface
>> that would let us do what we want that I could think of.
>>
>> Maybe it would help to go back to the basics and try to approach
>> this by analyzing the problem first. Would putting this together
>> be helpful?
>>
> 
> That depends on how you define helpful. It will not be helpful in getting
> this task done in reasonable time. It may be helpful in convincing me to
> reimplement this functionality for a third time.
> 
>>    1. the set of locale attributes we want to keep track of
>>       in our locale "database"
>>
> 
> What details are necessary to reduce the number of locales tested? The
> honest answer to this is _none_. We could just pick N random locales and
> run the test with them. That would satisfy the original issue of testing
> to many locales.
> 
> That idea has been discarded, so the next best thing to do is to have it
> include a list of installed locales, and the language, territory and
> codeset canonical names as well as the MB_CUR_LEN value for each. Those
> are the only criteria that we currently use for selecting locales in the
> tests.
> 
> I don't see anything else useful. If there is some detail that is useful,
> most likely we could check it by loading the locale and getting the data
> directly instead of caching that data ourselves.
> 
>>    2. one or more possible formats of the database
>>
> 
> Because of all of the differences between similarly named locales on
> different systems, I don't think it makes sense to keep the locale
> data in revision control. It should probably be generated at runtime
> and flushed to a file for reuse by later tests.
> 
> Given that, I don't feel that the format of the data is significant. It
> might be nice for it to be human readable, but that is about it.
> 
>>    3. the kinds of queries done in our locale tests, and the
>>       ones we expect to do in future tests
> 
> This is the important question. As mentioned above, the only thing that
> I see being used is selecting locales by name by MB_CUR_LEN.
> 
>> With that, we can create a prototype solution using an existing
>> query language of our choice (such as grep). Once that works,
>> the grammar should naturally fall out and we can reimplement
>> the prototype in the test driver.
>>
> 
> Isn't that what you did while I was in Boulder? That is how we arrived
> at this system of brace expansion and name matching that we are talking
> about now.
> 
> Your prototype boils down to something like this, where the fictional
> 'my_locale' utility lists the names of all installed locales followed
> by a seperator and then the MB_CUR_LEN value.
> 
>   for i in `echo $brace_expr`;
>   do
>     my_locale -a | grep -e $i
>   done
> 
> Honestly, I don't care what the grammar is. I don't care what the format
> of the file is, and I don't care what shell utility we are trying to fake
> today.
> 
> All I care about is finishing up this task. Two months is more than enough
> time for something like this to be designed and implemented.
> 
>> Martin
>>
>>


Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Martin Sebor <se...@roguewave.com>.
Travis Vitek wrote:
[...]
>> I think it might be simpler to keep things abstract but given my
>> specification above a simple query string would look like this:
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 2\n"
>>    "zh_*.UTF-8 3\n"
>>    "zh_*.UTF-8 4\n"
>>
>> for the equivalent of:
>>
>>       locale == "en_US.*"    && MB_CUR_MAX == 1
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 2
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 3
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 4
>>
> 
> I'm totally confused. If we're going to write out each of the expansions,
> then why did I take the time to implement brace expansion?
> 
> I thought the idea was to allow us to select locales using a brace expanded
> query string. If we are explicitly writing out each of the names, then we
> wasted a bunch of time writing brace expansion code.

Yes, the idea was/is to use brace expansion to provide alternation.
I'm not dismissing the idea, just observing that it doesn't neatly
fit into the grammar I outlined above in the absence of anything
else. Can you show your grammar and explain how it works so that
I can better understand what we're discussing?

> 
>> I'm not sure how you could use brace expressions here. Maybe it
>> should be the other way around (<SP> should be OR and <NL> AND).
>> But then the grep -e idea is out the window.
> 
> Well, if we're going down the road of rewriting this _again_

Usually design precedes implementation. If you opt to implement first
you need to be prepared to make changes based on flaws in your (or
anyone else's) ideas that you use to ground your implementation in.

> then how
> about using something like '&&' and '||', or even 'and' and 'or' for the
> logical operations and then '(' and ')' for grouping? Almost like the
> 'equivalent of' that you wrote above. Something that is readable by a
> C/C++ programmer or the average guy off of the street?

We discussed and rejected this syntax.

> 
> The truth is that not every guy knows grep,

Anyone who works with this aspect of stdcxx will need to know it.

> and I'm sure that those who
> do wouldn't expect to see a grammar that used '\n' and ' ' to represent
> logical operations.

I agree. I will welcome a better alternative that's in line with
the "grep" spirit of the solution we agreed on (so that it can
be made use of in the Expected Failures project).

> 
>> Or maybe we need
>> a way to denote/group terms. Then we might be able to say:
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 ({2..4})"
>>
>> expand it to
>>
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 (2 3 4)"
>>
>> and "know" that the spaces in "2 3 4" denote OR even though the
>> space in "zh_*.UTF-8 (" denotes an AND. Ugh. I'm not sure I lik
>> this all that much better.
>>
>>>   const char* locales =
>>>       rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}", 10);
>>>
>>> I don't see why that would be necessary. You can do it with the
>>> following query using normal brace expansion, and it's human readable.
>>>
>>>   const char* locales =
>>>       rw_locale_query ("{en_US.* 1,zh_*.UTF-8 {2..4}}", 10);
>> What's "{en_US.* 1,zh_*.UTF-8 {2..4}}" supposed to expand to?
>> Bash 3.2 doesn't expand it. I suppose it could be
>>    "en_US.* 1 zh_*.UTF-8 2 3 4" or
>> "en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"
> 
> I believe that this is _exactly_ what you suggested in our meetings when
> I was in Boulder the last time. Maybe I'm just confused, but I am pretty
> sure that was what was presented.

The whiteboard has been wiped clean so unless you wrote it down
we'll never know because I myself don't remember anymore. But if
you re-read the "long hanging fruit" thread I also suggested that
for example this string:

     "*_{JP,CN}.* {3,4}"

expand into

     "*_JP.* 4\n*_CN.* 4\n*_JP.* 3\n*_CN.* 3\n"

and probably a whole number of other things before then. The point
is that it was just a suggestion and not a fully baked design, and
like many other initial ideas it might have been flawed in more
than one way. Which is why we need to come up with a design spec
first, review it, and implement only if it makes sense to everyone
and if it deals with the use cases we're interested in dealing with.

> 
> The shell does whitespace collapse and tokenization before it does the
> expansion. To use whitespace in a brace expansion in the shell you have
> to escape it. 
> 
> So the following expansion should work just fine in csh...
> 
>   {en_US.*\ 1,zh_*.UTF-8\ {2..4}}
> 
> It should expand to
> 
>   en_US.* 1
>   zh_*.UTF-8 2
>   zh_*.UTF-8 3
>   zh_*.UTF-8 4

With newlines at the ends or without? I assume without, and that's
precisely why I'm not happy with it. Because the string above really
is
     "en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"

where each odd space means AND and every even space means OR. That
seems very unintuitive to me.

> 
> Remember that I originally provided rw_brace_expand() that doesn't do all
> of that. It treats whitespace like any other character. Of course if you
> insist on 100% compatibility with shell brace expansion, then feel free to
> escape the spaces. Personally I prefer strings without escapes.

So the explicit spaces aren't special but the ones that result from
brace expansion are? (This is an honest question.)

> 
>> Either way I think I'm getting confused by the lack of distinction
>> between what's OR and what's AND.
>>
> 
> I give an example above of how a brace expansion already solves the
> problem.
> 
> If the brace expansion routine I've written returns a null terminated
> buffer of null terminated strings that are the brace expansions and we
> have a function for doing primitive string matching [rw_fnmatch], then
> this is a pretty simple problem to solve.
> 
> This is exactly what you are doing with the xfail.txt thing. The platform
> string is just a brace expansion and grep-like expression...
> 
>   aix-5.3-*-vacpp-9.0-{12,15}?
> 
> Why can't ours be seperated by spaces, or some other character? Is it so
> different?

You mean spaces instead of the dashes above? Because brace expansion
also produces spaces and because brace expansion happens before any
other processing. How would the "other processing" distinguish our
spaces from those produced from brace expansion?

Maybe we should ask the question the other way: why can't we use
dashes (or some other non-delimiting characters) in the locale spec
instead of spaces? Would that make me remove my objection? (I think
it might but I'd want to think about it some more).

> 
> I suppose the big difference is that the format above is rigid and well
> defined, whereas the locale match format is still in flux.
> 
>>> I know that the '\n' is how you'd use `grep -e', but does it really make
>>> sense? We aren't using `grep -e' here.
>> I'm trying to model the interface on something we all know how
>> to use. grep -e seemed the closest example of an known interface
>> that would let us do what we want that I could think of.
>>
>> Maybe it would help to go back to the basics and try to approach
>> this by analyzing the problem first. Would putting this together
>> be helpful?
>>
> 
> That depends on how you define helpful. It will not be helpful in getting
> this task done in reasonable time. 

That depends on how you define reasonable ;-) I had hoped we'd be
done in two to four weeks when we started.

> It may be helpful in convincing me to
> reimplement this functionality for a third time.
> 
>>    1. the set of locale attributes we want to keep track of
>>       in our locale "database"
>>
> 
> What details are necessary to reduce the number of locales tested? The
> honest answer to this is _none_. We could just pick N random locales and
> run the test with them. That would satisfy the original issue of testing
> to many locales.

But it would, in some cases, most likely compromise the effectiveness
of the tests, and if truly random, make failures hard to reproduce.

> 
> That idea has been discarded,  so the next best thing to do is to have it
> include a list of installed locales, and the language, territory and
> codeset canonical names as well as the MB_CUR_LEN value for each. Those
> are the only criteria that we currently use for selecting locales in the
> tests.

Sounds reasonable. I see you added to the wiki. Great! What about
the OS each native name goes with? (e.g., utf8 on HP-UX vs UTF-8
on Linux). And locale aliases (e.g., en -> english -> en_US ->
en_US.ISO8859-1 on Linux -- I made up the names but the concept
is real)?

> 
> I don't see anything else useful. If there is some detail that is useful,
> most likely we could check it by loading the locale and getting the data
> directly instead of caching that data ourselves.
> 
>>    2. one or more possible formats of the database
>>
> 
> Because of all of the differences between similarly named locales on
> different systems, I don't think it makes sense to keep the locale
> data in revision control. It should probably be generated at runtime
> and flushed to a file for reuse by later tests.

Maybe. That to me seems like an implementation detail, but...

> 
> Given that, I don't feel that the format of the data is significant. It
> might be nice for it to be human readable, but that is about it.

...this seems relevant because it will dictate the structure of
the query. Unless you want to do a whole lot of massaging of the
query before you apply it to the file. And if you take the
prototype suggestion I made -- to implement it first using grep
or some such standard tool -- it will matter a whole lot. Also
if you want to get optimum performance out of your code and
minimize its complexity the format will matter. Structure it
too much and you'll end up writing a database API to parse it.

> 
>>    3. the kinds of queries done in our locale tests, and the
>>       ones we expect to do in future tests
> 
> This is the important question. As mentioned above, the only thing that
> I see being used is selecting locales by name by MB_CUR_LEN.

The codecvt or ctype tests do that. The numeric, monetary, and
time tests will probably be interested in the language (e.g.,
find any German locale so that I use German days of months to
test time_put). I'm saying we should try to come up with the
*actual* queries for these tests to see if what we're doing
here will be helpful there.

> 
>> With that, we can create a prototype solution using an existing
>> query language of our choice (such as grep). Once that works,
>> the grammar should naturally fall out and we can reimplement
>> the prototype in the test driver.
>>
> 
> Isn't that what you did while I was in Boulder? That is how we arrived
> at this system of brace expansion and name matching that we are talking
> about now.

By prototype I mean a solution that actually works. It could be
written in shell or Python and work on just a single platform,
and with limitations, but it proves the concept. What we did
when you visited in Boulder was sketch out a rough idea on the
whiteboard.

> 
> Your prototype boils down to something like this, where the fictional
> 'my_locale' utility lists the names of all installed locales followed
> by a seperator and then the MB_CUR_LEN value.
> 
>   for i in `echo $brace_expr`;
>   do
>     my_locale -a | grep -e $i
>   done

This is a sketch, not a prototype. I can't take it and run it to
see how it works. There is no locale file so I have no idea what
$i should look like.

> 
> Honestly, I don't care what the grammar is. I don't care what the format
> of the file is, and I don't care what shell utility we are trying to fake
> today.
> 
> All I care about is finishing up this task. Two months is more than enough
> time for something like this to be designed and implemented.

Yet we still don't have the final grammar. It seems to me that we
need to start with it. I don't see how anyone could write any amount
of code unless they know what the grammar looks like.

Martin


RE: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Travis Vitek <Tr...@roguewave.com>.
> 
> 
> 
> Martin Sebor wrote:
>  
> Travis Vitek wrote:
>>  
>> 
>>> From: Apache Wiki [mailto:wikidiffs@apache.org] 
>>>
>>> The new 
>>> interface will need to make it easy to specify such a set of 
>>> locales without explicitly naming them, and it will need to
>>> retrieve such locales without returning duplicates.
>>>
>> 
>> As mentioned before I don't know a good way to avoid duplicates other
>> than to compare every attribute of each facet of each locale to all of
>> the other locales. Just testing to see if the return from setlocale() is
>> the same as the input string is not enough. The user could have intalled
>> locales that have unique names but are copies of the data from some
>> other locale.
> 
> True, but we don't care about how long the test might run on
> some user's system. What we care about here is that *we* don't
> run tests unnecessarily on our own build servers, and we can
> safely make the simplifying assumption that there are no user
> defined locales installed on them.
> 
>> 
>>> The interface should make it easy to 
>>> express conjunction, disjunction, and negation of the terms 
>>> (parameters) and support (a perhaps simplified version of) 
>>> [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_cha
>>> p09.html#tag_09_03 Basic Regular Expression] syntax.
>> 
>> Conjunction, disjunction and negation? Are you saying you want to be
>> able to select all locales that are _not_ in some set, something like
>> you would get with a caret (^} in a grep expression?
> 
> No, I meant something simple like grep -v.
>

Okay, so this is an all-or-none type negation. I understand that, but I'm
not sure if it is necessary given the objective.

> 
>> 
>> I'm hoping that I'm just misunderstanding your comments. If not, then
>> this is news to me and I'm a bit curious just how this addition is
>> necessary to minimize the number of locales tested [i.e. the objective].
> 
> It may not be necessary. I included it for completeness, thinking
> if it wasn't already there it could be easily added in the form
> of an argument of the function. If it isn't there we can leave
> it out until we need it.
> 
>> 
>>> We've 
>>> decided to use shell brace expansion as a means of expressing 
>>> logical conjunction between terms: a valid brace expression is 
>>> expanded to obtain a set of terms implicitly connected by a 
>>> logical AND. Individual ('\n'-separated) lines of the query 
>>> string are taken to be implicitly connected by a logical OR. 
>>> This approach models the 
>>> [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.h
>>> tml grep] interface with each line loosely corresponding to 
>>> the argument of the `-e` option to `grep`.
>>>
>> 
>> I've seen you mention the '\n' seperated list thing before, but I still
>> can't make sense of it. Are you saying
> 
> In my mind the query expression consists of terms connected
> by operators for conjunction, disjunction (let's forget about
> negation for now). E.g., like so:
> 
>    qry-expr ::= <dis-expr>
>    dis-expr ::= <con-expr> | <dis-expr> <NL> <con-expr>
>    con-expr ::= <term> | <term> <SP> <con-expr>
> 
> For example:
> 
>    "foo bar" is a con-expr of the terms "foo" and "bar" denoting
>    the intersection of foo and bar, and
> 
>    "123 xyz\nKLM" is a dis-expr of the terms "123 xyz" and "KLM"
>    denoting the union of the the two terms. "123 xyz" is itself
>    a con-expr denoting the intersection of 123 and xyz.
> 
>> that to select `en_US.*' with a 1
>> byte encoding or `zh_*.UTF-8' with a 2, 3, or 4 byte encoding, I would
>> write the following query?
> 
> I think it might be simpler to keep things abstract but given my
> specification above a simple query string would look like this:
> 
>    "en_US.*    1\n"
>    "zh_*.UTF-8 2\n"
>    "zh_*.UTF-8 3\n"
>    "zh_*.UTF-8 4\n"
> 
> for the equivalent of:
> 
>       locale == "en_US.*"    && MB_CUR_MAX == 1
>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 2
>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 3
>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 4
> 

I'm totally confused. If we're going to write out each of the expansions,
then why did I take the time to implement brace expansion?

I thought the idea was to allow us to select locales using a brace expanded
query string. If we are explicitly writing out each of the names, then we
wasted a bunch of time writing brace expansion code.

> I'm not sure how you could use brace expressions here. Maybe it
> should be the other way around (<SP> should be OR and <NL> AND).
> But then the grep -e idea is out the window.

Well, if we're going down the road of rewriting this _again_ then how
about using something like '&&' and '||', or even 'and' and 'or' for the
logical operations and then '(' and ')' for grouping? Almost like the
'equivalent of' that you wrote above. Something that is readable by a
C/C++ programmer or the average guy off of the street?

The truth is that not every guy knows grep, and I'm sure that those who
do wouldn't expect to see a grammar that used '\n' and ' ' to represent
logical operations.

> Or maybe we need
> a way to denote/group terms. Then we might be able to say:
> 
>    "en_US.*    1\n"
>    "zh_*.UTF-8 ({2..4})"
> 
> expand it to
> 
>    "en_US.*    1\n"
>    "zh_*.UTF-8 (2 3 4)"
> 
> and "know" that the spaces in "2 3 4" denote OR even though the
> space in "zh_*.UTF-8 (" denotes an AND. Ugh. I'm not sure I lik
> this all that much better.
> 
>>   const char* locales =
>>       rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}", 10);
>> 
>> I don't see why that would be necessary. You can do it with the
>> following query using normal brace expansion, and it's human readable.
>> 
>>   const char* locales =
>>       rw_locale_query ("{en_US.* 1,zh_*.UTF-8 {2..4}}", 10);
> 
> What's "{en_US.* 1,zh_*.UTF-8 {2..4}}" supposed to expand to?
> Bash 3.2 doesn't expand it. I suppose it could be
>    "en_US.* 1 zh_*.UTF-8 2 3 4" or
> "en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"

I believe that this is _exactly_ what you suggested in our meetings when
I was in Boulder the last time. Maybe I'm just confused, but I am pretty
sure that was what was presented.

The shell does whitespace collapse and tokenization before it does the
expansion. To use whitespace in a brace expansion in the shell you have
to escape it. 

So the following expansion should work just fine in csh...

  {en_US.*\ 1,zh_*.UTF-8\ {2..4}}

It should expand to

  en_US.* 1
  zh_*.UTF-8 2
  zh_*.UTF-8 3
  zh_*.UTF-8 4

Remember that I originally provided rw_brace_expand() that doesn't do all
of that. It treats whitespace like any other character. Of course if you
insist on 100% compatibility with shell brace expansion, then feel free to
escape the spaces. Personally I prefer strings without escapes.

> 
> Either way I think I'm getting confused by the lack of distinction
> between what's OR and what's AND.
> 

I give an example above of how a brace expansion already solves the
problem.

If the brace expansion routine I've written returns a null terminated
buffer of null terminated strings that are the brace expansions and we
have a function for doing primitive string matching [rw_fnmatch], then
this is a pretty simple problem to solve.

This is exactly what you are doing with the xfail.txt thing. The platform
string is just a brace expansion and grep-like expression...

  aix-5.3-*-vacpp-9.0-{12,15}?

Why can't ours be seperated by spaces, or some other character? Is it so
different?

I suppose the big difference is that the format above is rigid and well
defined, whereas the locale match format is still in flux.

>> 
>> I know that the '\n' is how you'd use `grep -e', but does it really make
>> sense? We aren't using `grep -e' here.
> 
> I'm trying to model the interface on something we all know how
> to use. grep -e seemed the closest example of an known interface
> that would let us do what we want that I could think of.
> 
> Maybe it would help to go back to the basics and try to approach
> this by analyzing the problem first. Would putting this together
> be helpful?
>

That depends on how you define helpful. It will not be helpful in getting
this task done in reasonable time. It may be helpful in convincing me to
reimplement this functionality for a third time.

> 
>    1. the set of locale attributes we want to keep track of
>       in our locale "database"
> 

What details are necessary to reduce the number of locales tested? The
honest answer to this is _none_. We could just pick N random locales and
run the test with them. That would satisfy the original issue of testing
to many locales.

That idea has been discarded, so the next best thing to do is to have it
include a list of installed locales, and the language, territory and
codeset canonical names as well as the MB_CUR_LEN value for each. Those
are the only criteria that we currently use for selecting locales in the
tests.

I don't see anything else useful. If there is some detail that is useful,
most likely we could check it by loading the locale and getting the data
directly instead of caching that data ourselves.

>    2. one or more possible formats of the database
> 

Because of all of the differences between similarly named locales on
different systems, I don't think it makes sense to keep the locale
data in revision control. It should probably be generated at runtime
and flushed to a file for reuse by later tests.

Given that, I don't feel that the format of the data is significant. It
might be nice for it to be human readable, but that is about it.

>    3. the kinds of queries done in our locale tests, and the
>       ones we expect to do in future tests

This is the important question. As mentioned above, the only thing that
I see being used is selecting locales by name by MB_CUR_LEN.

> With that, we can create a prototype solution using an existing
> query language of our choice (such as grep). Once that works,
> the grammar should naturally fall out and we can reimplement
> the prototype in the test driver.
>

Isn't that what you did while I was in Boulder? That is how we arrived
at this system of brace expansion and name matching that we are talking
about now.

Your prototype boils down to something like this, where the fictional
'my_locale' utility lists the names of all installed locales followed
by a seperator and then the MB_CUR_LEN value.

  for i in `echo $brace_expr`;
  do
    my_locale -a | grep -e $i
  done

Honestly, I don't care what the grammar is. I don't care what the format
of the file is, and I don't care what shell utility we are trying to fake
today.

All I care about is finishing up this task. Two months is more than enough
time for something like this to be designed and implemented.

> Martin
> 
> 

Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Martin Sebor <se...@roguewave.com>.
Travis Vitek wrote:
>  
> 
>> From: Apache Wiki [mailto:wikidiffs@apache.org] 
>>
>> The new 
>> interface will need to make it easy to specify such a set of 
>> locales without explicitly naming them, and it will need to
>> retrieve such locales without returning duplicates.
>>
> 
> As mentioned before I don't know a good way to avoid duplicates other
> than to compare every attribute of each facet of each locale to all of
> the other locales. Just testing to see if the return from setlocale() is
> the same as the input string is not enough. The user could have intalled
> locales that have unique names but are copies of the data from some
> other locale.

True, but we don't care about how long the test might run on
some user's system. What we care about here is that *we* don't
run tests unnecessarily on our own build servers, and we can
safely make the simplifying assumption that there are no user
defined locales installed on them.

> 
>> The interface should make it easy to 
>> express conjunction, disjunction, and negation of the terms 
>> (parameters) and support (a perhaps simplified version of) 
>> [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_cha
>> p09.html#tag_09_03 Basic Regular Expression] syntax.
> 
> Conjunction, disjunction and negation? Are you saying you want to be
> able to select all locales that are _not_ in some set, something like
> you would get with a caret (^} in a grep expression?

No, I meant something simple like grep -v.

> 
> I'm hoping that I'm just misunderstanding your comments. If not, then
> this is news to me and I'm a bit curious just how this addition is
> necessary to minimize the number of locales tested [i.e. the objective].

It may not be necessary. I included it for completeness, thinking
if it wasn't already there it could be easily added in the form
of an argument of the function. If it isn't there we can leave
it out until we need it.

> 
>> We've 
>> decided to use shell brace expansion as a means of expressing 
>> logical conjunction between terms: a valid brace expression is 
>> expanded to obtain a set of terms implicitly connected by a 
>> logical AND. Individual ('\n'-separated) lines of the query 
>> string are taken to be implicitly connected by a logical OR. 
>> This approach models the 
>> [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.h
>> tml grep] interface with each line loosely corresponding to 
>> the argument of the `-e` option to `grep`.
>>
> 
> I've seen you mention the '\n' seperated list thing before, but I still
> can't make sense of it. Are you saying

In my mind the query expression consists of terms connected
by operators for conjunction, disjunction (let's forget about
negation for now). E.g., like so:

   qry-expr ::= <dis-expr>
   dis-expr ::= <con-expr> | <dis-expr> <NL> <con-expr>
   con-expr ::= <term> | <term> <SP> <con-expr>

For example:

   "foo bar" is a con-expr of the terms "foo" and "bar" denoting
   the intersection of foo and bar, and

   "123 xyz\nKLM" is a dis-expr of the terms "123 xyz" and "KLM"
   denoting the union of the the two terms. "123 xyz" is itself
   a con-expr denoting the intersection of 123 and xyz.

> that to select `en_US.*' with a 1
> byte encoding or `zh_*.UTF-8' with a 2, 3, or 4 byte encoding, I would
> write the following query?

I think it might be simpler to keep things abstract but given my
specification above a simple query string would look like this:

   "en_US.*    1\n"
   "zh_*.UTF-8 2\n"
   "zh_*.UTF-8 3\n"
   "zh_*.UTF-8 4\n"

for the equivalent of:

      locale == "en_US.*"    && MB_CUR_MAX == 1
   || locale == "zh_*.UTF-8" && MB_CUR_MAX == 2
   || locale == "zh_*.UTF-8" && MB_CUR_MAX == 3
   || locale == "zh_*.UTF-8" && MB_CUR_MAX == 4

I'm not sure how you could use brace expressions here. Maybe it
should be the other way around (<SP> should be OR and <NL> AND).
But then the grep -e idea is out the window. Or maybe we need
a way to denote/group terms. Then we might be able to say:

   "en_US.*    1\n"
   "zh_*.UTF-8 ({2..4})"

expand it to

   "en_US.*    1\n"
   "zh_*.UTF-8 (2 3 4)"

and "know" that the spaces in "2 3 4" denote OR even though the
space in "zh_*.UTF-8 (" denotes an AND. Ugh. I'm not sure I like
this all that much better.

> 
>   const char* locales = rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}",
> 10);
> 
> I don't see why that would be necessary. You can do it with the
> following query using normal brace expansion, and it's human readable.
> 
>   const char* locales = rw_locale_query ("{en_US.* 1,zh_*.UTF-8
> {2..4}}", 10);

What's "{en_US.* 1,zh_*.UTF-8 {2..4}}" supposed to expand to?
Bash 3.2 doesn't expand it. I suppose it could be
   "en_US.* 1 zh_*.UTF-8 2 3 4" or
"en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"

Either way I think I'm getting confused by the lack of distinction
between what's OR and what's AND.

> 
> I know that the '\n' is how you'd use `grep -e', but does it really make
> sense? We aren't using `grep -e' here.

I'm trying to model the interface on something we all know how
to use. grep -e seemed the closest example of an known interface
that would let us do what we want that I could think of.

Maybe it would help to go back to the basics and try to approach
this by analyzing the problem first. Would putting this together
be helpful?

   1. the set of locale attributes we want to keep track of
      in our locale "database"

   2. one or more possible formats of the database

   3. the kinds of queries done in our locale tests, and the
      ones we expect to do in future tests

With that, we can create a prototype solution using an existing
query language of our choice (such as grep). Once that works,
the grammar should naturally fall out and we can reimplement
the prototype in the test driver.

Martin

RE: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor

Posted by Travis Vitek <Tr...@roguewave.com>.
 

>From: Apache Wiki [mailto:wikidiffs@apache.org] 
>
>The new 
>interface will need to make it easy to specify such a set of 
>locales without explicitly naming them, and it will need to
>retrieve such locales without returning duplicates.
>

As mentioned before I don't know a good way to avoid duplicates other
than to compare every attribute of each facet of each locale to all of
the other locales. Just testing to see if the return from setlocale() is
the same as the input string is not enough. The user could have intalled
locales that have unique names but are copies of the data from some
other locale.

>The interface should make it easy to 
>express conjunction, disjunction, and negation of the terms 
>(parameters) and support (a perhaps simplified version of) 
>[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_cha
>p09.html#tag_09_03 Basic Regular Expression] syntax.

Conjunction, disjunction and negation? Are you saying you want to be
able to select all locales that are _not_ in some set, something like
you would get with a caret (^} in a grep expression?

I'm hoping that I'm just misunderstanding your comments. If not, then
this is news to me and I'm a bit curious just how this addition is
necessary to minimize the number of locales tested [i.e. the objective].

>We've 
>decided to use shell brace expansion as a means of expressing 
>logical conjunction between terms: a valid brace expression is 
>expanded to obtain a set of terms implicitly connected by a 
>logical AND. Individual ('\n'-separated) lines of the query 
>string are taken to be implicitly connected by a logical OR. 
>This approach models the 
>[http://www.opengroup.org/onlinepubs/009695399/utilities/grep.h
>tml grep] interface with each line loosely corresponding to 
>the argument of the `-e` option to `grep`.
>

I've seen you mention the '\n' seperated list thing before, but I still
can't make sense of it. Are you saying that to select `en_US.*' with a 1
byte encoding or `zh_*.UTF-8' with a 2, 3, or 4 byte encoding, I would
write the following query?

  const char* locales = rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}",
10);

I don't see why that would be necessary. You can do it with the
following query using normal brace expansion, and it's human readable.

  const char* locales = rw_locale_query ("{en_US.* 1,zh_*.UTF-8
{2..4}}", 10);

I know that the '\n' is how you'd use `grep -e', but does it really make
sense? We aren't using `grep -e' here.

Travis