You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/05/20 10:24:54 UTC

[GitHub] [couchdb] willholley opened a new issue #2895: In Cloudant Query regex with the caseless (?i) modifier is not case-insensitive for unicode strings

willholley opened a new issue #2895:
URL: https://github.com/apache/couchdb/issues/2895


   [NOTE]: # ( ^^ Provide a general summary of the issue in the title above. ^^ )
   
   ## Description
   
   In Cloudant Query when you use the $regex operator with the (?i} modifier which is supposed to make the match case-insensitive, the query is not in fact case-insensitive if the regex contains a character that is not in the ASCII character repertoire.
   
   This appears to be due to the way that unicode strings are represented internally and interaction with [re:run/3](https://erlang.org/doc/man/re.html#run-3).
   
   ## Steps to Reproduce
   
   ```
   $ curl -u admin:password -X PUT http://localhost:5984/test
   {"ok":true}
   
   $ curl -u admin:password -X POST http://localhost:5984/test -H "Content-Type: application/json" -d '
   {
     "_id": "1",
     "data": "xxxöxxx"
   }'
   {"ok":true,"id":"1","rev":"1-96be014c47e090c7705b66c4b646d6f6"}
   
   $ curl -u admin:password -X POST http://localhost:5984/test -H "Content-Type: application/json" -d '
   {
     "_id": "2",
     "data": "xxxÖxxx"
   }'
   {"ok":true,"id":"2","rev":"1-903d0f61be4c2eda02c1bd72d3ba92bc"}
   
   $ curl -u admin:password -X POST http://localhost:5984/test/_find -H "Content-Type: application/json" -d '
   {
     "selector": {
       "data": {
         "$regex": "(?i)ö"
       }
     }
   }'
   
   {"docs":[
   {"_id":"1","_rev":"1-96be014c47e090c7705b66c4b646d6f6","data":"xxxöxxx"}
   ],
   "bookmark": "g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozGoIkOGASEKEsAEqyDRk",
   "warning": "No matching index found, create an index to optimize query time."}
   ```
   
   ## Expected Behaviour
   
   The query selector should match both documents.
   
   ## Your Environment
   
   [TIP]:  # ( Include as many relevant details about your environment as possible. )
   [TIP]:  # ( You can paste the output of curl http://YOUR-COUCHDB:5984/ here. )
   
   * CouchDB version used: 3.1.0, Erlang 20
   * Browser name and version:
   * Operating system and version: Verified on RHEL and Debian
   
   ## Additional Context
   
   Tracing the call to `re:run/3` in Mango reveals that the data string from the 2 documents is treated differently:
   
   ```
   (<0.31742.114>) call mango_selector:match({[{<<"data">>,{[{<<"$regex">>,<<40,63,105,41,195,150>>}]}}]},{[{<<"_id">>,<<"2">>},
     {<<"_rev">>,<<"1-903d0f61be4c2eda02c1bd72d3ba92bc">>},
     {<<"data">>,<<120,120,120,195,150,120,120,120>>}]},#Fun<mango_json.cmp.2>)
   (<0.31742.114>) call mango_selector:match({[{<<"$regex">>,<<40,63,105,41,195,150>>}]},<<120,120,120,195,150,120,120,120>>,#Fun<mango_json.cmp.2>)
   (<0.31742.114>) call re:run(<<120,120,120,195,150,120,120,120>>,<<40,63,105,41,195,150>>,[{capture,none}])
   
   (<0.31706.114>) call mango_selector:match({[{<<"data">>,{[{<<"$regex">>,<<40,63,105,41,195,150>>}]}}]},{[{<<"_id">>,<<"1">>},
     {<<"_rev">>,<<"1-96be014c47e090c7705b66c4b646d6f6">>},
     {<<"data">>,<<"xxxöxxx">>}]},#Fun<mango_json.cmp.2>)
   (<0.31706.114>) call mango_selector:match({[{<<"$regex">>,<<40,63,105,41,195,150>>}]},<<"xxxöxxx">>,#Fun<mango_json.cmp.2>)
   (<0.31706.114>) call
   re:run(<<"xxxöxxx">>,<<40,63,105,41,195,150>>,[{capture,none}])
   ```
   
   The string `"xxxöxxx"` is represented as `<<"xxxöxxx">>`, but the string `"xxxÖxxx"` is represented as `<<"xxxÖxxx"/utf8>>`. `re:run/3` behaviour differs depending on the string representation (in Erlang 20, at least):
   
   |data|regex|options|result|
   |------|-----|--------|-------|
   |`<<"xxxöxxx">>`|`<<"(?i)ö">>`||`match`|
   |`<<"xxxöxxx"/utf8>>`|`<<"(?i)ö">>`||`nomatch`|
   |`<<"xxxöxxx"/utf8>>`|`<<"(?i)ö"/utf8>>`||`match`|
   |`<<"xxxöxxx">>`|`<<"(?i)ö"/utf8>>`|`unicode`|`nomatch`|
   |`<<"xxxÖxxx"/utf8>>`|`<<"(?i)ö"/utf8>>`|`unicode`|`match`|
   |`<<"xxxÖxxx"/utf8>>`|`<<"(?i)ö"/utf8>>`||`nomatch`|
   |`<<"xxxöxxx">>`|`<<"(?i)ö">>`|`unicode`|`** exception error: bad argument`|
   
   Whether the string is represented as utf8 encoded or not seems to stem all the way from [couch_doc:to_json_obj/2](https://github.com/apache/couchdb/blob/master/src/couch/src/couch_doc.erl#L117) so I wonder if anything else in Mango would be impacted by this.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] davisp commented on issue #2895: In CouchDB Query regex with the caseless (?i) modifier is not case-insensitive for unicode strings

Posted by GitBox <gi...@apache.org>.
davisp commented on issue #2895:
URL: https://github.com/apache/couchdb/issues/2895#issuecomment-634188167


   Excellent write up. Unfortunately the Erlang shell and its weirdo UTF-8 behavior caused you more pain that it helped. The issue is that Erlang's displaying binaries and interpreter binaries differently that you'd normally expect.
   
   A quick test to see that in action is to print out the actual binary values to see how things have been interpreted (Running on Erlang 22 locally):
   
   Also for clarity, `ö` is `192 186` and `Ö` is `192 150`.
   
   ```erlang
   Forms = [
       <<"xxxöxxx">>,
       <<"xxxöxxx"/utf8>>,
       <<"xxxÖxxx">>,
       <<"xxxÖxxx"/utf8>>,
       <<"(?i)ö">>,
       <<"(?i)ö"/utf8>>
   ],
   lists:foreach(fun(F) ->
       io:format("~w~n", [F])
   end, Forms).
   
   <<120,120,120,246,120,120,120>>
   <<120,120,120,195,182,120,120,120>>
   <<120,120,120,214,120,120,120>>
   <<120,120,120,195,150,120,120,120>>
   <<40,63,105,41,246>>
   <<40,63,105,41,195,182>>
   ```
   
   You'll notice that the `/utf8` flag is merely telling the shell to correctly interpret unicode characters that have been typed into the shell. So the table from above is just showing this albeit in a round about fashion.
   
   Internally since we know everything has gone through Jiffy we know everything is valid UTF8, so it then becomes a question of providing that flag and whether or not its something we should do since it could be a behavior change.
   
   Although, we can also control that flag from the pattern as well, so this works as expected:
   
   ```bash
   curl -u admin:password -X POST http://localhost:5984/test/_find -H "Content-Type: application/json" -d '
   {
     "selector": {
       "data": {
         "$regex": "(*UTF8)(?i)ö"
       }
     }
   }'
   {"docs":[
   {"_id":"1","_rev":"1-96be014c47e090c7705b66c4b646d6f6","data":"xxxöxxx"},
   {"_id":"2","_rev":"1-903d0f61be4c2eda02c1bd72d3ba92bc","data":"xxxÖxxx"}
   ],
   "bookmark": "g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozGoEkOGASEKEsAErJDRs",
   "warning": "No matching index found, create an index to optimize query time."}
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] willholley closed issue #2895: In CouchDB Query regex with the caseless (?i) modifier is not case-insensitive for unicode strings

Posted by GitBox <gi...@apache.org>.
willholley closed issue #2895:
URL: https://github.com/apache/couchdb/issues/2895


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] willholley commented on issue #2895: In CouchDB Query regex with the caseless (?i) modifier is not case-insensitive for unicode strings

Posted by GitBox <gi...@apache.org>.
willholley commented on issue #2895:
URL: https://github.com/apache/couchdb/issues/2895#issuecomment-634526728


   thansks @davisp. Specifying `(*UTF8)` seems like a reasonable workaround for users that need to match unicode. Closing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] davisp edited a comment on issue #2895: In CouchDB Query regex with the caseless (?i) modifier is not case-insensitive for unicode strings

Posted by GitBox <gi...@apache.org>.
davisp edited a comment on issue #2895:
URL: https://github.com/apache/couchdb/issues/2895#issuecomment-634188167


   Excellent write up. Unfortunately the Erlang shell and its weirdo UTF-8 behavior caused you more pain that it helped. The issue is that Erlang's displaying binaries and interpreting binaries differently that you might expect.
   
   A quick test to see that in action is to print out the actual binary values to see how things have been interpreted (Running on Erlang 22 locally):
   
   Also for clarity, `ö` is `192 186` and `Ö` is `192 150`.
   
   ```erlang
   Forms = [
       <<"xxxöxxx">>,
       <<"xxxöxxx"/utf8>>,
       <<"xxxÖxxx">>,
       <<"xxxÖxxx"/utf8>>,
       <<"(?i)ö">>,
       <<"(?i)ö"/utf8>>
   ],
   lists:foreach(fun(F) ->
       io:format("~w~n", [F])
   end, Forms).
   
   <<120,120,120,246,120,120,120>>
   <<120,120,120,195,182,120,120,120>>
   <<120,120,120,214,120,120,120>>
   <<120,120,120,195,150,120,120,120>>
   <<40,63,105,41,246>>
   <<40,63,105,41,195,182>>
   ```
   
   You'll notice that the `/utf8` flag is merely telling the shell to correctly interpret unicode characters that have been typed into the shell. So the table from above is just showing this albeit in a round about fashion.
   
   Internally since we know everything has gone through Jiffy we know everything is valid UTF8, so it then becomes a question of providing that flag and whether or not its something we should do since it could be a behavior change.
   
   Although, we can also control that flag from the pattern as well, so this works as expected:
   
   ```bash
   curl -u admin:password -X POST http://localhost:5984/test/_find -H "Content-Type: application/json" -d '
   {
     "selector": {
       "data": {
         "$regex": "(*UTF8)(?i)ö"
       }
     }
   }'
   {"docs":[
   {"_id":"1","_rev":"1-96be014c47e090c7705b66c4b646d6f6","data":"xxxöxxx"},
   {"_id":"2","_rev":"1-903d0f61be4c2eda02c1bd72d3ba92bc","data":"xxxÖxxx"}
   ],
   "bookmark": "g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozGoEkOGASEKEsAErJDRs",
   "warning": "No matching index found, create an index to optimize query time."}
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org