You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2021/10/13 06:16:57 UTC

[GitHub] [couchdb] nickva opened a new pull request #3783: Fix reduce view row collation with unicode equivalent keys

nickva opened a new pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783


   Previously, view reduce collation with keys relied on the keys in the rows returned from the view shards to exactly match (=:=) the keys specified in the args. However, in the case when there are multiple rows which compare equal with the unicode collator, that may not always be the case.
   
   In that case when the rows are fetched from the row dict by key, they should be matched using the same collation algorithm as the one used on the view shards. Since, the collation module we use doesn't export an equivalence function, we mimic it by using the `(not A < B) and (not B < A)` logic. This relies on the subtle fact that all the things we pass into the less/2 function have a defined comparison order. Another option is to use the Erlang standard library `string:equal(A, B, false, nfd)` function, which is more direct. However, it could eventually hit a case when the Erlang standard libraries NFD equivalence algorithm might not agree with our collation library's (libicu) idea of equivalence.
   
   Issue: https://github.com/apache/couchdb/issues/3773


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] jcoglan commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
jcoglan commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949635068


   What behaviour does this produce for the examples in #3773? In that issue we have two stored view rows with keys that have different bytes, representing different codepoints, but which can be viewed equal under Unicode normalisation.
   
   - one key (call this key A) has bytes `c3 ae` representing codepoints U+00EE
   - the other (key B) has bytes `69 cc 82` representing codepoints U+0069 U+0302
   
   When these keys are retrieved from the same shard they're considered equal when reducing, but not when fetched from different shards. This meant that:
   
   - with `q=1`, querying the view with no filter gives 1 row. Querying for key A gives 1 row and key B gives 0 rows.
   - with `q=2`, querying the view with no filter gives 2 rows. Querying for key A gives 1 row and key B also gives 1 row.
   
   What behaviour does this patch give for these scenarios? Would it be possible to write tests for them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva edited a comment on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
nickva edited a comment on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949932469


   > if CouchDB has considered those strings to be equal but doesn't normalise them in its results.
   
   CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be had, as it would involve compatibility with existent views.
   
   CouchDB relies on unicode comparisons only (`less(A,B) -> -1 | 0 | 1`). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
nickva commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949932469


   > if CouchDB has considered those strings to be equal but doesn't normalise them in its results.
   
   CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be made, as it would involve compatibility with existent views.
   
   CouchDB relies on unicode comparison only (`less(A,B) -> -1 | 0 | 1`). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] jcoglan commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
jcoglan commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949638061


   For clarity I'm referring to the queries performed by these requests:
   
   ```
   curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true"
   
   curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'
   
   curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] janl merged pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
janl merged pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] janl merged pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
janl merged pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nono commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
nono commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949423630


   @jcoglan what do you think of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] jcoglan commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
jcoglan commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949660910


   It looks like https://github.com/apache/couchdb/issues/3773#issuecomment-941963658 is consistent with the first case I mention above. Do we mind that the results contain unnormalised keys? In @nickva's example:
   
   ```
   --- c h a i n e ---
   {"rows":[
   {"key":["file","chaîne"],"value":2}
   ]}
   
   --- c h a i ^ n e ---
   {"rows":[
   {"key":["file","chaîne"],"value":2}
   ]}
   ```
   
   The first result has `c3 ae` and the second has `69 cc 82` in the `"key"` field. I'm wondering if applications could get confused by this, if CouchDB has considered those strings to be equal but doesn't normalise them in its results. It's quite possible we should leave those bytes alone though and not transform strings passed to us by the application.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] jcoglan commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
jcoglan commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-950657897


   @nickva I think you're right, normalising keys in output could be a significant behaviour change so something that requires more thought and planning, not something to roll into this fix 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] jcoglan commented on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
jcoglan commented on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949655148


   Seems to me that we either want:
   
   - the keys are considered equal, in which case all three queries return a single row with value `2`
   - the keys are considered different, in which case the first query returns two rows with value `1`, and the latter two queries return one row with value `1`
   
   And we want these results to be independent of `q`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva edited a comment on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
nickva edited a comment on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949932469


   > if CouchDB has considered those strings to be equal but doesn't normalise them in its results.
   
   CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be made, as it would involve compatibility with existent views.
   
   CouchDB relies on unicode comparisons only (`less(A,B) -> -1 | 0 | 1`). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva edited a comment on pull request #3783: Fix reduce view row collation with unicode equivalent keys

Posted by GitBox <gi...@apache.org>.
nickva edited a comment on pull request #3783:
URL: https://github.com/apache/couchdb/pull/3783#issuecomment-949932469


   > if CouchDB has considered those strings to be equal but doesn't normalise them in its results.
   
   CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be made, as it would involve compatibility with existent views.
   
   CouchDB relies on unicode comparisons only (`less(A,B) -> -1 | 0 | 1`). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org