You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Nick Vatamaniuc <va...@apache.org> on 2021/10/29 23:01:02 UTC

[DISCUSS] Handle libicu upgrades better

Hello everyone,

CouchDB by default uses the libicu library to sort its view rows.
When views are built, we do not record or track the version of the
collation algorithm. The issue is that the ICU library may modify the
collation order between major libicu versions, and when that happens,
views built with the older versions may experience data loss. I wanted
to discuss the option to record the libicu collator version in each
view then warn the user when there is a mismatch. Also, optionally
ignore the mismatch, or automatically rebuild the views.

Imagine, for example, searching patient records using start/end keys.
It could be possible that, say, the first letter of their name now
collates differently in a new libicu. That would prevent the patient
record from showing up in the view results for some important
procedure or medication. Users might not even be aware of this kind of
data loss occurring, there won't be any error in the API or warning in
the logs.

I was thinking how to solve this. There were a few commits already to
cleanup our collation drivers [1], expose libicu and collation
algorithm version in the new _versions endpoint [2], and some other
minor fixes in that area. As the next steps we could:

  1) Modify our views to keep track of the collation algorithm
version. We could attempt to transparently upgrade the view header
format -- read the old view file, update the header with an extra
libicu collation version field, that updates the signature, and then,
save the file with the new header and new signature. This avoids view
rebuilds, just records the collator version in the view and moves the
files to a new name.

  2) Do what PostgreSQL does, and 2a) emit a warning with the view
results when the current libicu version doesn't match the version in
the view [3]. That means altering the view results to add a "warning":
"..." field. Another alternative 2b) is emit a warning in the
_design/$ddoc/_info only. Users would have to know that after an OS
version upgrade, or restoring backups, to make sure to look at their
_design/$ddoc/_info for each db for each ddoc. Of course, there may be
users which used the "raw" collation option, or know they are using
just the plain ASCII character sets in their views. So we'd have a
configuration setting to ignore the warnings as well.

  3) Users who see the warning, could then either rebuild the view
with the new collator library manually, or it could happen
automatically based on a configuration option, basically "when
collator versions are miss-matched, invalidate and rebuild all the
views".

  4) We'd have a way for the users to assert (POST a ddoc update) that
they double-checked the new ICU version and are convinced that a
particular view would not experience data loss with the new collator.
That should make the warning go away, and the view to not be rebuilt.
This can't be just a naive "collator" option setting as both per-view
and per-design options are used when computing the view signature, and
any changes there would result in the view being rebuilt. Perhaps we
can add it to the design docs as a separate option which is excluded
from the signature hash, like the "autoupdate" setting for background
index builder ("collation_version_accept"?). PostgreSQL also offers
this option with the ALTER COLLATION ... REFRESH VERSION command [3]

What do we think, is this a reasonable approach? Is there something
easier / simpler we can do?

Thanks!
-Nick

[1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
[2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
[3] https://www.postgresql.org/docs/13/sql-altercollation.html

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Interesting idea, Will, about possibly using the collation functions
from the query server side. There is no current mechanism to do it;
we'd have to invent it.

If we could reliably detect the couchjs libicu library version, we
could try to track it separately from the libicu used to sort view
keys. But don't think it's exposed as a JS library call (like we have
for get_libicu_version in the NIF module). But if we tracked it, and
there was a version mismatch, we wouldn't even be able to use the
trick from above to recompact the view, and we'd have to fully reset
it.

I noticed in your link how there is a mode to disable libicu linking
'--without-intl-api',  which turns off some APIs on the JS side. One
way to ensure we don't need to track libicu versions linked to the
collator is to disable its usage :-)  At first it seems rather
unusual, however it could provide some stability guarantee about the
views not becoming invalid after couchjs is upgraded. (There is of
course the chance that the other APIs users used in the new JS engine
somehow generate different data on the newer engine, which would also
invalidate the old views. It would have to be libicu, say some math
operations or one other string processing functions).

Perhaps we should just track couchjs versions and engine types in the
view file headers like we're starting to do with libicu versions? I
feel like we might need that at some point, but also it feels like a
future effort. Since we'd have to handle full view resets, warnings,
user assertions about their view / js engine being compatible etc.

Cheers,
-Nick

On Thu, Jan 13, 2022 at 12:14 PM Will Young <lo...@gmail.com> wrote:
>
> I would be a little hesitant to rely on the version mozilla wraps up
> reliably working since they modify it a little and don't actually use
> its normal build system:
> https://firefox-source-docs.mozilla.org/intl/icu.html#internationalization-in-spidermonkey-and-gecko
>
> That icu->spidermonkey->couch creates the longest and most fragile
> dependency chain when doing a full windows build is frustrating
> especially since its not really clear if anyone would be making use of
> the Intl, string.prototype.normalization() or similar functionality in
> spidermonkey. If C functions to access things from icu were being
> explicitly registered in the JS context from the query server code
> that could break this dependency chain and be able to disable intl and
> know what to do explicitly when icu really should be used. That would
> also make it easier to replace spidermonkey with a more minimalist JS
> interpreter.
>
> -Will
>
> Am Do., 13. Jan. 2022 um 17:38 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> >
> > Hi Ronny,
> >
> > If it makes it easier to build on some platforms it could make sense.
> > Or find some way for both of them to point to a single libicu library.
> >
> > On some OSes (ex. Linux distros), dynamically linking to a system
> > libicu also makes sense because it's often the easiest way to get
> > security updates. libicu has had quite a number of high risk CVEs over
> > the years [1]
> >
> > [1] https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu
> >
> > -Nick
> >
> > On Wed, Jan 12, 2022 at 2:47 PM Ronny Berndt
> > <ro...@kioskkinder.com.invalid> wrote:
> > >
> > > Hi,
> > >
> > > to prevent different versions of the ICU libs, why don’t use the shipped version
> > > of the libs from the spidermonkey tree (use only esr versions) and link against those in the build process
> > > and don’t rely on the system version?
> > >
> > > The windows version of CouchDB isn’t available for the actual version and the build process for this
> > > os stucks at the moment. Maybe it is a broader discussion and maybe it is a good idea to combine
> > > this with the erlang version update process ([DISCUSS] Erlang version update process for convenience binaries).
> > >
> > > - Ronny
> > >
> > > > Am 12.01.2022 um 16:31 schrieb Will Young <lo...@gmail.com>:
> > > >
> > > > Hi Nick,
> > > >
> > > >  I like the way this breaks down the problem into something that can
> > > > work with the existing maintenance mechanisms. On the UCA version it
> > > > looks to me like the major version tracks the last unicode version
> > > > that had a collation change (version 9.0?), while the ICU version is
> > > > changing with each release which would be more frequent than actual
> > > > collation changes. Looking at the ICU release notes I get the
> > > > impression that the frequency of change may inbetween because of bug
> > > > fixes or additions to unicode that directly get a differing order in
> > > > the root collation. I.e. ICU 54 seems like a clean match of UCA
> > > > version and collation change while it seems like 59 could have changed
> > > > some emoji sort orders that may already have been reflected in 58's
> > > > UCA version?
> > > >
> > > > Another question I have about ICU synchronization is spidermonkey's
> > > > use of ICU. Since all build instructions keep erlang and mozjs'
> > > > linking to the same system ICU, I think there could never be a need to
> > > > record an ICU related version from the query server, but I've never
> > > > seen instructions to set locales in relation the query server or do
> > > > anything to ensure a function is using the root collator, so I don't
> > > > think the build setup reflects an actual need for spidermonkey to be
> > > > truly in sync on aspects of icu like collation setup and everything
> > > > important is happening in the erlang/nifs?
> > > > Thanks,
> > > > -Will
> > > >
> > > > Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> > > >>
> > > >> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> > > >>
> > > >> Would that work? There are two tricks there - re-using a field
> > > >> position from an older <2.3.1 format, this should allow transparently
> > > >> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > > >> map  so it should allow adding extra info to the views in the future
> > > >> (custom collation tailorings?).
> > > >>
> > > >> Thanks,
> > > >> -Nick
> > > >>
> > > >> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >>>
> > > >>> Thanks, Adam. And thanks for the tip about the view header, Bob.
> > > >>>
> > > >>> Wonder if a disk version would make sense for views. Noticed Eric did
> > > >>> a nice job transparently migrating 2.x -> 3.x view files when we
> > > >>> removed key seq indices. Perhaps something like that would work for
> > > >>> adding a collator version.
> > > >>>
> > > >>> Cheers,
> > > >>> -Nick
> > > >>>
> > > >>> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > > >>>>
> > > >>>> That seems like a smart solution Nick.
> > > >>>>
> > > >>>> Adam
> > > >>>>
> > > >>>>> On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > >>>>>
> > > >>>>> Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > >>>>>
> > > >>>>> B.
> > > >>>>>
> > > >>>>>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>> Thinking more about this issue I wonder if we can avoid resetting and
> > > >>>>>> rebuilding everything from scratch, and instead, let the upgrade
> > > >>>>>> happen in the background, while still serving the existing view data.
> > > >>>>>>
> > > >>>>>> The realization was that collation doesn't affect the emitted keys and
> > > >>>>>> values themselves, only their order in the view b-trees. That means
> > > >>>>>> we'd just have to rebuild b-trees, and that is exactly what our view
> > > >>>>>> compactor already does.
> > > >>>>>>
> > > >>>>>> When we detect a libicu version discrepancy we'd submit the view for
> > > >>>>>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > >>>>>> which handles file version format upgrades, but we'll tweak that logic
> > > >>>>>> to trigger on libicu version mismatches as well.
> > > >>>>>>
> > > >>>>>> Would this work? Does anyone see any issue with that approach?
> > > >>>>>>
> > > >>>>>> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > >>>>>>
> > > >>>>>> Cheers,
> > > >>>>>> -Nick
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > >>>>>>>
> > > >>>>>>> Hello everyone,
> > > >>>>>>>
> > > >>>>>>> CouchDB by default uses the libicu library to sort its view rows.
> > > >>>>>>> When views are built, we do not record or track the version of the
> > > >>>>>>> collation algorithm. The issue is that the ICU library may modify the
> > > >>>>>>> collation order between major libicu versions, and when that happens,
> > > >>>>>>> views built with the older versions may experience data loss. I wanted
> > > >>>>>>> to discuss the option to record the libicu collator version in each
> > > >>>>>>> view then warn the user when there is a mismatch. Also, optionally
> > > >>>>>>> ignore the mismatch, or automatically rebuild the views.
> > > >>>>>>>
> > > >>>>>>> Imagine, for example, searching patient records using start/end keys.
> > > >>>>>>> It could be possible that, say, the first letter of their name now
> > > >>>>>>> collates differently in a new libicu. That would prevent the patient
> > > >>>>>>> record from showing up in the view results for some important
> > > >>>>>>> procedure or medication. Users might not even be aware of this kind of
> > > >>>>>>> data loss occurring, there won't be any error in the API or warning in
> > > >>>>>>> the logs.
> > > >>>>>>>
> > > >>>>>>> I was thinking how to solve this. There were a few commits already to
> > > >>>>>>> cleanup our collation drivers [1], expose libicu and collation
> > > >>>>>>> algorithm version in the new _versions endpoint [2], and some other
> > > >>>>>>> minor fixes in that area. As the next steps we could:
> > > >>>>>>>
> > > >>>>>>> 1) Modify our views to keep track of the collation algorithm
> > > >>>>>>> version. We could attempt to transparently upgrade the view header
> > > >>>>>>> format -- read the old view file, update the header with an extra
> > > >>>>>>> libicu collation version field, that updates the signature, and then,
> > > >>>>>>> save the file with the new header and new signature. This avoids view
> > > >>>>>>> rebuilds, just records the collator version in the view and moves the
> > > >>>>>>> files to a new name.
> > > >>>>>>>
> > > >>>>>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > >>>>>>> results when the current libicu version doesn't match the version in
> > > >>>>>>> the view [3]. That means altering the view results to add a "warning":
> > > >>>>>>> "..." field. Another alternative 2b) is emit a warning in the
> > > >>>>>>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > >>>>>>> version upgrade, or restoring backups, to make sure to look at their
> > > >>>>>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > >>>>>>> users which used the "raw" collation option, or know they are using
> > > >>>>>>> just the plain ASCII character sets in their views. So we'd have a
> > > >>>>>>> configuration setting to ignore the warnings as well.
> > > >>>>>>>
> > > >>>>>>> 3) Users who see the warning, could then either rebuild the view
> > > >>>>>>> with the new collator library manually, or it could happen
> > > >>>>>>> automatically based on a configuration option, basically "when
> > > >>>>>>> collator versions are miss-matched, invalidate and rebuild all the
> > > >>>>>>> views".
> > > >>>>>>>
> > > >>>>>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > >>>>>>> they double-checked the new ICU version and are convinced that a
> > > >>>>>>> particular view would not experience data loss with the new collator.
> > > >>>>>>> That should make the warning go away, and the view to not be rebuilt.
> > > >>>>>>> This can't be just a naive "collator" option setting as both per-view
> > > >>>>>>> and per-design options are used when computing the view signature, and
> > > >>>>>>> any changes there would result in the view being rebuilt. Perhaps we
> > > >>>>>>> can add it to the design docs as a separate option which is excluded
> > > >>>>>>> from the signature hash, like the "autoupdate" setting for background
> > > >>>>>>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > >>>>>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > >>>>>>>
> > > >>>>>>> What do we think, is this a reasonable approach? Is there something
> > > >>>>>>> easier / simpler we can do?
> > > >>>>>>>
> > > >>>>>>> Thanks!
> > > >>>>>>> -Nick
> > > >>>>>>>
> > > >>>>>>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > >>>>>>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > >>>>>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > >>>>>
> > > >>>>
> > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Will Young <lo...@gmail.com>.
I would be a little hesitant to rely on the version mozilla wraps up
reliably working since they modify it a little and don't actually use
its normal build system:
https://firefox-source-docs.mozilla.org/intl/icu.html#internationalization-in-spidermonkey-and-gecko

That icu->spidermonkey->couch creates the longest and most fragile
dependency chain when doing a full windows build is frustrating
especially since its not really clear if anyone would be making use of
the Intl, string.prototype.normalization() or similar functionality in
spidermonkey. If C functions to access things from icu were being
explicitly registered in the JS context from the query server code
that could break this dependency chain and be able to disable intl and
know what to do explicitly when icu really should be used. That would
also make it easier to replace spidermonkey with a more minimalist JS
interpreter.

-Will

Am Do., 13. Jan. 2022 um 17:38 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
>
> Hi Ronny,
>
> If it makes it easier to build on some platforms it could make sense.
> Or find some way for both of them to point to a single libicu library.
>
> On some OSes (ex. Linux distros), dynamically linking to a system
> libicu also makes sense because it's often the easiest way to get
> security updates. libicu has had quite a number of high risk CVEs over
> the years [1]
>
> [1] https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu
>
> -Nick
>
> On Wed, Jan 12, 2022 at 2:47 PM Ronny Berndt
> <ro...@kioskkinder.com.invalid> wrote:
> >
> > Hi,
> >
> > to prevent different versions of the ICU libs, why don’t use the shipped version
> > of the libs from the spidermonkey tree (use only esr versions) and link against those in the build process
> > and don’t rely on the system version?
> >
> > The windows version of CouchDB isn’t available for the actual version and the build process for this
> > os stucks at the moment. Maybe it is a broader discussion and maybe it is a good idea to combine
> > this with the erlang version update process ([DISCUSS] Erlang version update process for convenience binaries).
> >
> > - Ronny
> >
> > > Am 12.01.2022 um 16:31 schrieb Will Young <lo...@gmail.com>:
> > >
> > > Hi Nick,
> > >
> > >  I like the way this breaks down the problem into something that can
> > > work with the existing maintenance mechanisms. On the UCA version it
> > > looks to me like the major version tracks the last unicode version
> > > that had a collation change (version 9.0?), while the ICU version is
> > > changing with each release which would be more frequent than actual
> > > collation changes. Looking at the ICU release notes I get the
> > > impression that the frequency of change may inbetween because of bug
> > > fixes or additions to unicode that directly get a differing order in
> > > the root collation. I.e. ICU 54 seems like a clean match of UCA
> > > version and collation change while it seems like 59 could have changed
> > > some emoji sort orders that may already have been reflected in 58's
> > > UCA version?
> > >
> > > Another question I have about ICU synchronization is spidermonkey's
> > > use of ICU. Since all build instructions keep erlang and mozjs'
> > > linking to the same system ICU, I think there could never be a need to
> > > record an ICU related version from the query server, but I've never
> > > seen instructions to set locales in relation the query server or do
> > > anything to ensure a function is using the root collator, so I don't
> > > think the build setup reflects an actual need for spidermonkey to be
> > > truly in sync on aspects of icu like collation setup and everything
> > > important is happening in the erlang/nifs?
> > > Thanks,
> > > -Will
> > >
> > > Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> > >>
> > >> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> > >>
> > >> Would that work? There are two tricks there - re-using a field
> > >> position from an older <2.3.1 format, this should allow transparently
> > >> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > >> map  so it should allow adding extra info to the views in the future
> > >> (custom collation tailorings?).
> > >>
> > >> Thanks,
> > >> -Nick
> > >>
> > >> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > >>>
> > >>> Thanks, Adam. And thanks for the tip about the view header, Bob.
> > >>>
> > >>> Wonder if a disk version would make sense for views. Noticed Eric did
> > >>> a nice job transparently migrating 2.x -> 3.x view files when we
> > >>> removed key seq indices. Perhaps something like that would work for
> > >>> adding a collator version.
> > >>>
> > >>> Cheers,
> > >>> -Nick
> > >>>
> > >>> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > >>>>
> > >>>> That seems like a smart solution Nick.
> > >>>>
> > >>>> Adam
> > >>>>
> > >>>>> On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > >>>>>
> > >>>>> Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > >>>>>>
> > >>>>>> Thinking more about this issue I wonder if we can avoid resetting and
> > >>>>>> rebuilding everything from scratch, and instead, let the upgrade
> > >>>>>> happen in the background, while still serving the existing view data.
> > >>>>>>
> > >>>>>> The realization was that collation doesn't affect the emitted keys and
> > >>>>>> values themselves, only their order in the view b-trees. That means
> > >>>>>> we'd just have to rebuild b-trees, and that is exactly what our view
> > >>>>>> compactor already does.
> > >>>>>>
> > >>>>>> When we detect a libicu version discrepancy we'd submit the view for
> > >>>>>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > >>>>>> which handles file version format upgrades, but we'll tweak that logic
> > >>>>>> to trigger on libicu version mismatches as well.
> > >>>>>>
> > >>>>>> Would this work? Does anyone see any issue with that approach?
> > >>>>>>
> > >>>>>> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> -Nick
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > >>>>>>>
> > >>>>>>> Hello everyone,
> > >>>>>>>
> > >>>>>>> CouchDB by default uses the libicu library to sort its view rows.
> > >>>>>>> When views are built, we do not record or track the version of the
> > >>>>>>> collation algorithm. The issue is that the ICU library may modify the
> > >>>>>>> collation order between major libicu versions, and when that happens,
> > >>>>>>> views built with the older versions may experience data loss. I wanted
> > >>>>>>> to discuss the option to record the libicu collator version in each
> > >>>>>>> view then warn the user when there is a mismatch. Also, optionally
> > >>>>>>> ignore the mismatch, or automatically rebuild the views.
> > >>>>>>>
> > >>>>>>> Imagine, for example, searching patient records using start/end keys.
> > >>>>>>> It could be possible that, say, the first letter of their name now
> > >>>>>>> collates differently in a new libicu. That would prevent the patient
> > >>>>>>> record from showing up in the view results for some important
> > >>>>>>> procedure or medication. Users might not even be aware of this kind of
> > >>>>>>> data loss occurring, there won't be any error in the API or warning in
> > >>>>>>> the logs.
> > >>>>>>>
> > >>>>>>> I was thinking how to solve this. There were a few commits already to
> > >>>>>>> cleanup our collation drivers [1], expose libicu and collation
> > >>>>>>> algorithm version in the new _versions endpoint [2], and some other
> > >>>>>>> minor fixes in that area. As the next steps we could:
> > >>>>>>>
> > >>>>>>> 1) Modify our views to keep track of the collation algorithm
> > >>>>>>> version. We could attempt to transparently upgrade the view header
> > >>>>>>> format -- read the old view file, update the header with an extra
> > >>>>>>> libicu collation version field, that updates the signature, and then,
> > >>>>>>> save the file with the new header and new signature. This avoids view
> > >>>>>>> rebuilds, just records the collator version in the view and moves the
> > >>>>>>> files to a new name.
> > >>>>>>>
> > >>>>>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > >>>>>>> results when the current libicu version doesn't match the version in
> > >>>>>>> the view [3]. That means altering the view results to add a "warning":
> > >>>>>>> "..." field. Another alternative 2b) is emit a warning in the
> > >>>>>>> _design/$ddoc/_info only. Users would have to know that after an OS
> > >>>>>>> version upgrade, or restoring backups, to make sure to look at their
> > >>>>>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > >>>>>>> users which used the "raw" collation option, or know they are using
> > >>>>>>> just the plain ASCII character sets in their views. So we'd have a
> > >>>>>>> configuration setting to ignore the warnings as well.
> > >>>>>>>
> > >>>>>>> 3) Users who see the warning, could then either rebuild the view
> > >>>>>>> with the new collator library manually, or it could happen
> > >>>>>>> automatically based on a configuration option, basically "when
> > >>>>>>> collator versions are miss-matched, invalidate and rebuild all the
> > >>>>>>> views".
> > >>>>>>>
> > >>>>>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > >>>>>>> they double-checked the new ICU version and are convinced that a
> > >>>>>>> particular view would not experience data loss with the new collator.
> > >>>>>>> That should make the warning go away, and the view to not be rebuilt.
> > >>>>>>> This can't be just a naive "collator" option setting as both per-view
> > >>>>>>> and per-design options are used when computing the view signature, and
> > >>>>>>> any changes there would result in the view being rebuilt. Perhaps we
> > >>>>>>> can add it to the design docs as a separate option which is excluded
> > >>>>>>> from the signature hash, like the "autoupdate" setting for background
> > >>>>>>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > >>>>>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > >>>>>>>
> > >>>>>>> What do we think, is this a reasonable approach? Is there something
> > >>>>>>> easier / simpler we can do?
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>> -Nick
> > >>>>>>>
> > >>>>>>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > >>>>>>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > >>>>>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > >>>>>
> > >>>>
> >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Hi Ronny,

If it makes it easier to build on some platforms it could make sense.
Or find some way for both of them to point to a single libicu library.

On some OSes (ex. Linux distros), dynamically linking to a system
libicu also makes sense because it's often the easiest way to get
security updates. libicu has had quite a number of high risk CVEs over
the years [1]

[1] https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu

-Nick

On Wed, Jan 12, 2022 at 2:47 PM Ronny Berndt
<ro...@kioskkinder.com.invalid> wrote:
>
> Hi,
>
> to prevent different versions of the ICU libs, why don’t use the shipped version
> of the libs from the spidermonkey tree (use only esr versions) and link against those in the build process
> and don’t rely on the system version?
>
> The windows version of CouchDB isn’t available for the actual version and the build process for this
> os stucks at the moment. Maybe it is a broader discussion and maybe it is a good idea to combine
> this with the erlang version update process ([DISCUSS] Erlang version update process for convenience binaries).
>
> - Ronny
>
> > Am 12.01.2022 um 16:31 schrieb Will Young <lo...@gmail.com>:
> >
> > Hi Nick,
> >
> >  I like the way this breaks down the problem into something that can
> > work with the existing maintenance mechanisms. On the UCA version it
> > looks to me like the major version tracks the last unicode version
> > that had a collation change (version 9.0?), while the ICU version is
> > changing with each release which would be more frequent than actual
> > collation changes. Looking at the ICU release notes I get the
> > impression that the frequency of change may inbetween because of bug
> > fixes or additions to unicode that directly get a differing order in
> > the root collation. I.e. ICU 54 seems like a clean match of UCA
> > version and collation change while it seems like 59 could have changed
> > some emoji sort orders that may already have been reflected in 58's
> > UCA version?
> >
> > Another question I have about ICU synchronization is spidermonkey's
> > use of ICU. Since all build instructions keep erlang and mozjs'
> > linking to the same system ICU, I think there could never be a need to
> > record an ICU related version from the query server, but I've never
> > seen instructions to set locales in relation the query server or do
> > anything to ensure a function is using the root collator, so I don't
> > think the build setup reflects an actual need for spidermonkey to be
> > truly in sync on aspects of icu like collation setup and everything
> > important is happening in the erlang/nifs?
> > Thanks,
> > -Will
> >
> > Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> >>
> >> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> >>
> >> Would that work? There are two tricks there - re-using a field
> >> position from an older <2.3.1 format, this should allow transparently
> >> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> >> map  so it should allow adding extra info to the views in the future
> >> (custom collation tailorings?).
> >>
> >> Thanks,
> >> -Nick
> >>
> >> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> >>>
> >>> Thanks, Adam. And thanks for the tip about the view header, Bob.
> >>>
> >>> Wonder if a disk version would make sense for views. Noticed Eric did
> >>> a nice job transparently migrating 2.x -> 3.x view files when we
> >>> removed key seq indices. Perhaps something like that would work for
> >>> adding a collator version.
> >>>
> >>> Cheers,
> >>> -Nick
> >>>
> >>> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> >>>>
> >>>> That seems like a smart solution Nick.
> >>>>
> >>>> Adam
> >>>>
> >>>>> On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> >>>>>
> >>>>> Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> >>>>>>
> >>>>>> Thinking more about this issue I wonder if we can avoid resetting and
> >>>>>> rebuilding everything from scratch, and instead, let the upgrade
> >>>>>> happen in the background, while still serving the existing view data.
> >>>>>>
> >>>>>> The realization was that collation doesn't affect the emitted keys and
> >>>>>> values themselves, only their order in the view b-trees. That means
> >>>>>> we'd just have to rebuild b-trees, and that is exactly what our view
> >>>>>> compactor already does.
> >>>>>>
> >>>>>> When we detect a libicu version discrepancy we'd submit the view for
> >>>>>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> >>>>>> which handles file version format upgrades, but we'll tweak that logic
> >>>>>> to trigger on libicu version mismatches as well.
> >>>>>>
> >>>>>> Would this work? Does anyone see any issue with that approach?
> >>>>>>
> >>>>>> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -Nick
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> >>>>>>>
> >>>>>>> Hello everyone,
> >>>>>>>
> >>>>>>> CouchDB by default uses the libicu library to sort its view rows.
> >>>>>>> When views are built, we do not record or track the version of the
> >>>>>>> collation algorithm. The issue is that the ICU library may modify the
> >>>>>>> collation order between major libicu versions, and when that happens,
> >>>>>>> views built with the older versions may experience data loss. I wanted
> >>>>>>> to discuss the option to record the libicu collator version in each
> >>>>>>> view then warn the user when there is a mismatch. Also, optionally
> >>>>>>> ignore the mismatch, or automatically rebuild the views.
> >>>>>>>
> >>>>>>> Imagine, for example, searching patient records using start/end keys.
> >>>>>>> It could be possible that, say, the first letter of their name now
> >>>>>>> collates differently in a new libicu. That would prevent the patient
> >>>>>>> record from showing up in the view results for some important
> >>>>>>> procedure or medication. Users might not even be aware of this kind of
> >>>>>>> data loss occurring, there won't be any error in the API or warning in
> >>>>>>> the logs.
> >>>>>>>
> >>>>>>> I was thinking how to solve this. There were a few commits already to
> >>>>>>> cleanup our collation drivers [1], expose libicu and collation
> >>>>>>> algorithm version in the new _versions endpoint [2], and some other
> >>>>>>> minor fixes in that area. As the next steps we could:
> >>>>>>>
> >>>>>>> 1) Modify our views to keep track of the collation algorithm
> >>>>>>> version. We could attempt to transparently upgrade the view header
> >>>>>>> format -- read the old view file, update the header with an extra
> >>>>>>> libicu collation version field, that updates the signature, and then,
> >>>>>>> save the file with the new header and new signature. This avoids view
> >>>>>>> rebuilds, just records the collator version in the view and moves the
> >>>>>>> files to a new name.
> >>>>>>>
> >>>>>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> >>>>>>> results when the current libicu version doesn't match the version in
> >>>>>>> the view [3]. That means altering the view results to add a "warning":
> >>>>>>> "..." field. Another alternative 2b) is emit a warning in the
> >>>>>>> _design/$ddoc/_info only. Users would have to know that after an OS
> >>>>>>> version upgrade, or restoring backups, to make sure to look at their
> >>>>>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> >>>>>>> users which used the "raw" collation option, or know they are using
> >>>>>>> just the plain ASCII character sets in their views. So we'd have a
> >>>>>>> configuration setting to ignore the warnings as well.
> >>>>>>>
> >>>>>>> 3) Users who see the warning, could then either rebuild the view
> >>>>>>> with the new collator library manually, or it could happen
> >>>>>>> automatically based on a configuration option, basically "when
> >>>>>>> collator versions are miss-matched, invalidate and rebuild all the
> >>>>>>> views".
> >>>>>>>
> >>>>>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> >>>>>>> they double-checked the new ICU version and are convinced that a
> >>>>>>> particular view would not experience data loss with the new collator.
> >>>>>>> That should make the warning go away, and the view to not be rebuilt.
> >>>>>>> This can't be just a naive "collator" option setting as both per-view
> >>>>>>> and per-design options are used when computing the view signature, and
> >>>>>>> any changes there would result in the view being rebuilt. Perhaps we
> >>>>>>> can add it to the design docs as a separate option which is excluded
> >>>>>>> from the signature hash, like the "autoupdate" setting for background
> >>>>>>> index builder ("collation_version_accept"?). PostgreSQL also offers
> >>>>>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> >>>>>>>
> >>>>>>> What do we think, is this a reasonable approach? Is there something
> >>>>>>> easier / simpler we can do?
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> -Nick
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> >>>>>>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> >>>>>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> >>>>>
> >>>>
>

Re: [DISCUSS] Handle libicu upgrades better

Posted by Ronny Berndt <ro...@kioskkinder.com.INVALID>.
Hi,

to prevent different versions of the ICU libs, why don’t use the shipped version
of the libs from the spidermonkey tree (use only esr versions) and link against those in the build process
and don’t rely on the system version?

The windows version of CouchDB isn’t available for the actual version and the build process for this
os stucks at the moment. Maybe it is a broader discussion and maybe it is a good idea to combine
this with the erlang version update process ([DISCUSS] Erlang version update process for convenience binaries).

- Ronny

> Am 12.01.2022 um 16:31 schrieb Will Young <lo...@gmail.com>:
> 
> Hi Nick,
> 
>  I like the way this breaks down the problem into something that can
> work with the existing maintenance mechanisms. On the UCA version it
> looks to me like the major version tracks the last unicode version
> that had a collation change (version 9.0?), while the ICU version is
> changing with each release which would be more frequent than actual
> collation changes. Looking at the ICU release notes I get the
> impression that the frequency of change may inbetween because of bug
> fixes or additions to unicode that directly get a differing order in
> the root collation. I.e. ICU 54 seems like a clean match of UCA
> version and collation change while it seems like 59 could have changed
> some emoji sort orders that may already have been reflected in 58's
> UCA version?
> 
> Another question I have about ICU synchronization is spidermonkey's
> use of ICU. Since all build instructions keep erlang and mozjs'
> linking to the same system ICU, I think there could never be a need to
> record an ICU related version from the query server, but I've never
> seen instructions to set locales in relation the query server or do
> anything to ensure a function is using the root collator, so I don't
> think the build setup reflects an actual need for spidermonkey to be
> truly in sync on aspects of icu like collation setup and everything
> important is happening in the erlang/nifs?
> Thanks,
> -Will
> 
> Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
>> 
>> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
>> 
>> Would that work? There are two tricks there - re-using a field
>> position from an older <2.3.1 format, this should allow transparently
>> downgrading back to 3.2.1 as we ignore that field there. Also, used a
>> map  so it should allow adding extra info to the views in the future
>> (custom collation tailorings?).
>> 
>> Thanks,
>> -Nick
>> 
>> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
>>> 
>>> Thanks, Adam. And thanks for the tip about the view header, Bob.
>>> 
>>> Wonder if a disk version would make sense for views. Noticed Eric did
>>> a nice job transparently migrating 2.x -> 3.x view files when we
>>> removed key seq indices. Perhaps something like that would work for
>>> adding a collator version.
>>> 
>>> Cheers,
>>> -Nick
>>> 
>>> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
>>>> 
>>>> That seems like a smart solution Nick.
>>>> 
>>>> Adam
>>>> 
>>>>> On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
>>>>> 
>>>>> Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
>>>>> 
>>>>> B.
>>>>> 
>>>>>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
>>>>>> 
>>>>>> Thinking more about this issue I wonder if we can avoid resetting and
>>>>>> rebuilding everything from scratch, and instead, let the upgrade
>>>>>> happen in the background, while still serving the existing view data.
>>>>>> 
>>>>>> The realization was that collation doesn't affect the emitted keys and
>>>>>> values themselves, only their order in the view b-trees. That means
>>>>>> we'd just have to rebuild b-trees, and that is exactly what our view
>>>>>> compactor already does.
>>>>>> 
>>>>>> When we detect a libicu version discrepancy we'd submit the view for
>>>>>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
>>>>>> which handles file version format upgrades, but we'll tweak that logic
>>>>>> to trigger on libicu version mismatches as well.
>>>>>> 
>>>>>> Would this work? Does anyone see any issue with that approach?
>>>>>> 
>>>>>> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
>>>>>> 
>>>>>> Cheers,
>>>>>> -Nick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
>>>>>>> 
>>>>>>> Hello everyone,
>>>>>>> 
>>>>>>> CouchDB by default uses the libicu library to sort its view rows.
>>>>>>> When views are built, we do not record or track the version of the
>>>>>>> collation algorithm. The issue is that the ICU library may modify the
>>>>>>> collation order between major libicu versions, and when that happens,
>>>>>>> views built with the older versions may experience data loss. I wanted
>>>>>>> to discuss the option to record the libicu collator version in each
>>>>>>> view then warn the user when there is a mismatch. Also, optionally
>>>>>>> ignore the mismatch, or automatically rebuild the views.
>>>>>>> 
>>>>>>> Imagine, for example, searching patient records using start/end keys.
>>>>>>> It could be possible that, say, the first letter of their name now
>>>>>>> collates differently in a new libicu. That would prevent the patient
>>>>>>> record from showing up in the view results for some important
>>>>>>> procedure or medication. Users might not even be aware of this kind of
>>>>>>> data loss occurring, there won't be any error in the API or warning in
>>>>>>> the logs.
>>>>>>> 
>>>>>>> I was thinking how to solve this. There were a few commits already to
>>>>>>> cleanup our collation drivers [1], expose libicu and collation
>>>>>>> algorithm version in the new _versions endpoint [2], and some other
>>>>>>> minor fixes in that area. As the next steps we could:
>>>>>>> 
>>>>>>> 1) Modify our views to keep track of the collation algorithm
>>>>>>> version. We could attempt to transparently upgrade the view header
>>>>>>> format -- read the old view file, update the header with an extra
>>>>>>> libicu collation version field, that updates the signature, and then,
>>>>>>> save the file with the new header and new signature. This avoids view
>>>>>>> rebuilds, just records the collator version in the view and moves the
>>>>>>> files to a new name.
>>>>>>> 
>>>>>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
>>>>>>> results when the current libicu version doesn't match the version in
>>>>>>> the view [3]. That means altering the view results to add a "warning":
>>>>>>> "..." field. Another alternative 2b) is emit a warning in the
>>>>>>> _design/$ddoc/_info only. Users would have to know that after an OS
>>>>>>> version upgrade, or restoring backups, to make sure to look at their
>>>>>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
>>>>>>> users which used the "raw" collation option, or know they are using
>>>>>>> just the plain ASCII character sets in their views. So we'd have a
>>>>>>> configuration setting to ignore the warnings as well.
>>>>>>> 
>>>>>>> 3) Users who see the warning, could then either rebuild the view
>>>>>>> with the new collator library manually, or it could happen
>>>>>>> automatically based on a configuration option, basically "when
>>>>>>> collator versions are miss-matched, invalidate and rebuild all the
>>>>>>> views".
>>>>>>> 
>>>>>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
>>>>>>> they double-checked the new ICU version and are convinced that a
>>>>>>> particular view would not experience data loss with the new collator.
>>>>>>> That should make the warning go away, and the view to not be rebuilt.
>>>>>>> This can't be just a naive "collator" option setting as both per-view
>>>>>>> and per-design options are used when computing the view signature, and
>>>>>>> any changes there would result in the view being rebuilt. Perhaps we
>>>>>>> can add it to the design docs as a separate option which is excluded
>>>>>>> from the signature hash, like the "autoupdate" setting for background
>>>>>>> index builder ("collation_version_accept"?). PostgreSQL also offers
>>>>>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>>>>>>> 
>>>>>>> What do we think, is this a reasonable approach? Is there something
>>>>>>> easier / simpler we can do?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> -Nick
>>>>>>> 
>>>>>>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
>>>>>>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
>>>>>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
>>>>> 
>>>> 


Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Hi Will,

Thanks for taking a look.

We currently do not support any custom tailorings, so I was hoping
that using the UCA version would be enough to indicate an
incompatibility. It has the advantage of being human readable and
matches the official UCA versions published so we can track changes
easier. But, perhaps, a better option to use is ucol_getVersion() [1].
This version would capture the UCA version, data tables and possible
tailorings if we add them in the future. The downside is it's opaque
so it would be hard to associate with a particular libicu version.
Postgres currently uses ucol_getVersion, so I think we'll switch to
that as well.

Good point about synchronizing the ICU with spidermonkey. If both are
linked to a shared libicu library, we could query the libicu version
at runtime for info/debug purposes (with GET /_node/_local/_versions).
That will tell us the version used. Otherwise, if spidermonkey is
built with a statically linked libicu it could have a completely
separate version.  And in principle, I could see how a different
version of JS engine and how it compares strings would emit different
keys and values. That was primarily why I thought of adding a metadata
map item to the view header so we could capture things like that in
the future. Presumably, then we'd have to completely reset and rebuild
the views from scratch.

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a0f98dd01ba7a64069ade6f0fda13528d

Cheers,
-Nick

On Wed, Jan 12, 2022 at 10:39 AM Will Young <lo...@gmail.com> wrote:
>
> Hi Nick,
>
>   I like the way this breaks down the problem into something that can
> work with the existing maintenance mechanisms. On the UCA version it
> looks to me like the major version tracks the last unicode version
> that had a collation change (version 9.0?), while the ICU version is
> changing with each release which would be more frequent than actual
> collation changes. Looking at the ICU release notes I get the
> impression that the frequency of change may inbetween because of bug
> fixes or additions to unicode that directly get a differing order in
> the root collation. I.e. ICU 54 seems like a clean match of UCA
> version and collation change while it seems like 59 could have changed
> some emoji sort orders that may already have been reflected in 58's
> UCA version?
>
> Another question I have about ICU synchronization is spidermonkey's
> use of ICU. Since all build instructions keep erlang and mozjs'
> linking to the same system ICU, I think there could never be a need to
> record an ICU related version from the query server, but I've never
> seen instructions to set locales in relation the query server or do
> anything to ensure a function is using the root collator, so I don't
> think the build setup reflects an actual need for spidermonkey to be
> truly in sync on aspects of icu like collation setup and everything
> important is happening in the erlang/nifs?
>  Thanks,
> -Will
>
> Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> >
> > I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> >
> > Would that work? There are two tricks there - re-using a field
> > position from an older <2.3.1 format, this should allow transparently
> > downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > map  so it should allow adding extra info to the views in the future
> > (custom collation tailorings?).
> >
> > Thanks,
> > -Nick
> >
> > On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > >
> > > Thanks, Adam. And thanks for the tip about the view header, Bob.
> > >
> > > Wonder if a disk version would make sense for views. Noticed Eric did
> > > a nice job transparently migrating 2.x -> 3.x view files when we
> > > removed key seq indices. Perhaps something like that would work for
> > > adding a collator version.
> > >
> > > Cheers,
> > > -Nick
> > >
> > > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > > >
> > > > That seems like a smart solution Nick.
> > > >
> > > > Adam
> > > >
> > > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > > >
> > > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > > >
> > > > > B.
> > > > >
> > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > > >>
> > > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > > >> happen in the background, while still serving the existing view data.
> > > > >>
> > > > >> The realization was that collation doesn't affect the emitted keys and
> > > > >> values themselves, only their order in the view b-trees. That means
> > > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > > >> compactor already does.
> > > > >>
> > > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > > >> which handles file version format upgrades, but we'll tweak that logic
> > > > >> to trigger on libicu version mismatches as well.
> > > > >>
> > > > >> Would this work? Does anyone see any issue with that approach?
> > > > >>
> > > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > > >>
> > > > >> Cheers,
> > > > >> -Nick
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > > >>>
> > > > >>> Hello everyone,
> > > > >>>
> > > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > > >>> When views are built, we do not record or track the version of the
> > > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > > >>> collation order between major libicu versions, and when that happens,
> > > > >>> views built with the older versions may experience data loss. I wanted
> > > > >>> to discuss the option to record the libicu collator version in each
> > > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > > >>> ignore the mismatch, or automatically rebuild the views.
> > > > >>>
> > > > >>> Imagine, for example, searching patient records using start/end keys.
> > > > >>> It could be possible that, say, the first letter of their name now
> > > > >>> collates differently in a new libicu. That would prevent the patient
> > > > >>> record from showing up in the view results for some important
> > > > >>> procedure or medication. Users might not even be aware of this kind of
> > > > >>> data loss occurring, there won't be any error in the API or warning in
> > > > >>> the logs.
> > > > >>>
> > > > >>> I was thinking how to solve this. There were a few commits already to
> > > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > > >>> minor fixes in that area. As the next steps we could:
> > > > >>>
> > > > >>> 1) Modify our views to keep track of the collation algorithm
> > > > >>> version. We could attempt to transparently upgrade the view header
> > > > >>> format -- read the old view file, update the header with an extra
> > > > >>> libicu collation version field, that updates the signature, and then,
> > > > >>> save the file with the new header and new signature. This avoids view
> > > > >>> rebuilds, just records the collator version in the view and moves the
> > > > >>> files to a new name.
> > > > >>>
> > > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > > >>> results when the current libicu version doesn't match the version in
> > > > >>> the view [3]. That means altering the view results to add a "warning":
> > > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > > >>> users which used the "raw" collation option, or know they are using
> > > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > > >>> configuration setting to ignore the warnings as well.
> > > > >>>
> > > > >>> 3) Users who see the warning, could then either rebuild the view
> > > > >>> with the new collator library manually, or it could happen
> > > > >>> automatically based on a configuration option, basically "when
> > > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > > >>> views".
> > > > >>>
> > > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > > >>> they double-checked the new ICU version and are convinced that a
> > > > >>> particular view would not experience data loss with the new collator.
> > > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > > >>> This can't be just a naive "collator" option setting as both per-view
> > > > >>> and per-design options are used when computing the view signature, and
> > > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > > >>> can add it to the design docs as a separate option which is excluded
> > > > >>> from the signature hash, like the "autoupdate" setting for background
> > > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > > >>>
> > > > >>> What do we think, is this a reasonable approach? Is there something
> > > > >>> easier / simpler we can do?
> > > > >>>
> > > > >>> Thanks!
> > > > >>> -Nick
> > > > >>>
> > > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > > >
> > > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Will Young <lo...@gmail.com>.
Hi Nick,

  I like the way this breaks down the problem into something that can
work with the existing maintenance mechanisms. On the UCA version it
looks to me like the major version tracks the last unicode version
that had a collation change (version 9.0?), while the ICU version is
changing with each release which would be more frequent than actual
collation changes. Looking at the ICU release notes I get the
impression that the frequency of change may inbetween because of bug
fixes or additions to unicode that directly get a differing order in
the root collation. I.e. ICU 54 seems like a clean match of UCA
version and collation change while it seems like 59 could have changed
some emoji sort orders that may already have been reflected in 58's
UCA version?

Another question I have about ICU synchronization is spidermonkey's
use of ICU. Since all build instructions keep erlang and mozjs'
linking to the same system ICU, I think there could never be a need to
record an ICU related version from the query server, but I've never
seen instructions to set locales in relation the query server or do
anything to ensure a function is using the root collator, so I don't
think the build setup reflects an actual need for spidermonkey to be
truly in sync on aspects of icu like collation setup and everything
important is happening in the erlang/nifs?
 Thanks,
-Will

Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
>
> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
>
> Would that work? There are two tricks there - re-using a field
> position from an older <2.3.1 format, this should allow transparently
> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> map  so it should allow adding extra info to the views in the future
> (custom collation tailorings?).
>
> Thanks,
> -Nick
>
> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> >
> > Thanks, Adam. And thanks for the tip about the view header, Bob.
> >
> > Wonder if a disk version would make sense for views. Noticed Eric did
> > a nice job transparently migrating 2.x -> 3.x view files when we
> > removed key seq indices. Perhaps something like that would work for
> > adding a collator version.
> >
> > Cheers,
> > -Nick
> >
> > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > >
> > > That seems like a smart solution Nick.
> > >
> > > Adam
> > >
> > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > >
> > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > >
> > > > B.
> > > >
> > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >>
> > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > >> happen in the background, while still serving the existing view data.
> > > >>
> > > >> The realization was that collation doesn't affect the emitted keys and
> > > >> values themselves, only their order in the view b-trees. That means
> > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > >> compactor already does.
> > > >>
> > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > >> which handles file version format upgrades, but we'll tweak that logic
> > > >> to trigger on libicu version mismatches as well.
> > > >>
> > > >> Would this work? Does anyone see any issue with that approach?
> > > >>
> > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > >>
> > > >> Cheers,
> > > >> -Nick
> > > >>
> > > >>
> > > >>
> > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > >>>
> > > >>> Hello everyone,
> > > >>>
> > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > >>> When views are built, we do not record or track the version of the
> > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > >>> collation order between major libicu versions, and when that happens,
> > > >>> views built with the older versions may experience data loss. I wanted
> > > >>> to discuss the option to record the libicu collator version in each
> > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > >>> ignore the mismatch, or automatically rebuild the views.
> > > >>>
> > > >>> Imagine, for example, searching patient records using start/end keys.
> > > >>> It could be possible that, say, the first letter of their name now
> > > >>> collates differently in a new libicu. That would prevent the patient
> > > >>> record from showing up in the view results for some important
> > > >>> procedure or medication. Users might not even be aware of this kind of
> > > >>> data loss occurring, there won't be any error in the API or warning in
> > > >>> the logs.
> > > >>>
> > > >>> I was thinking how to solve this. There were a few commits already to
> > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > >>> minor fixes in that area. As the next steps we could:
> > > >>>
> > > >>> 1) Modify our views to keep track of the collation algorithm
> > > >>> version. We could attempt to transparently upgrade the view header
> > > >>> format -- read the old view file, update the header with an extra
> > > >>> libicu collation version field, that updates the signature, and then,
> > > >>> save the file with the new header and new signature. This avoids view
> > > >>> rebuilds, just records the collator version in the view and moves the
> > > >>> files to a new name.
> > > >>>
> > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > >>> results when the current libicu version doesn't match the version in
> > > >>> the view [3]. That means altering the view results to add a "warning":
> > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > >>> users which used the "raw" collation option, or know they are using
> > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > >>> configuration setting to ignore the warnings as well.
> > > >>>
> > > >>> 3) Users who see the warning, could then either rebuild the view
> > > >>> with the new collator library manually, or it could happen
> > > >>> automatically based on a configuration option, basically "when
> > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > >>> views".
> > > >>>
> > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > >>> they double-checked the new ICU version and are convinced that a
> > > >>> particular view would not experience data loss with the new collator.
> > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > >>> This can't be just a naive "collator" option setting as both per-view
> > > >>> and per-design options are used when computing the view signature, and
> > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > >>> can add it to the design docs as a separate option which is excluded
> > > >>> from the signature hash, like the "autoupdate" setting for background
> > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > >>>
> > > >>> What do we think, is this a reasonable approach? Is there something
> > > >>> easier / simpler we can do?
> > > >>>
> > > >>> Thanks!
> > > >>> -Nick
> > > >>>
> > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > >
> > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Hi everyone,

The pull request is open and ready for review:
https://github.com/apache/couchdb/pull/3889

In summary, I think we managed to track the collation versions in
views in a backwards compatible manner. There is a new view info map
in the header which can be used in the future to record additional
metadata about views (couchjs versions?, etc). With a slight bit of
trickery, the views can also be transparently downgraded to the 3.2.1
version without a signature change or a view reset.

Views which have more than one collator version will be submitted for
re-compaction to the already existing smoosh upgrade views channel.
The automatic trigger can be disabled as well with a new config
option.

As far as HTTP API differences these two endpoints now emit additional
collator version info:
  _design/*/_info "view_index" has a "collator_versions" list which
shows the list of versions for that particular view
  _node/*/_versions "collation_driver" object has a new
"collator_version" which shows the collator version
An example of this can be seen in
https://github.com/apache/couchdb/pull/3889#issuecomment-1022643789

Thanks to everyone who participated in the discussion! Let's continue
the review in the PR comments

Cheers,
-Nick

On Tue, Jan 25, 2022 at 10:15 AM Nick Vatamaniuc <va...@gmail.com> wrote:
>
> Good idea, Will, to return the current collator version in the
> `/_node/_local/_versions` output. We return the collation algorithm
> and the library versions, however, since we switched to tracking the
> opaque collator version, it's good to also expose that too.
>
> On Tue, Jan 25, 2022 at 9:47 AM Will Young <lo...@gmail.com> wrote:
> >
> > Hi Nick,
> >
> >   The view _info setup looks good to me. Maybe it would be helpful to
> > print the current runtime's collator and icu versions somewhere like
> > the / meta or  /node/ _system endpoint? I think that would provide a
> > way to cross-reference to alleviate the drawback of the collator being
> > the least human readable version (though only to find a more readable
> > version for the views that are from the current runtime,) and maybe to
> > debug oddities like a cluster somehow having nodes that are out of
> > sync on libicu versions, or just to make it easier to check if dbs are
> > going to be rebuilt after an update. Of course there are also other
> > ways for an admin to examine the current runtime and workout versions
> > so it is probably a question of how frequently it will come up.
> >
> > Thanks,
> > Will
> >
> > Am Di., 25. Jan. 2022 um 07:45 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> > >
> > > Another update regarding the draft PR.
> > >
> > > There are now upgrade tests to see how we handle older 2.x, 3.2.1, and
> > > views with multiple collator versions in them.
> > >
> > > The last commit modifies the _design/*/_info API to return the list of
> > > collator versions to the user and wanted to see what everyone thought
> > > about it:
> > >
> > > https://github.com/apache/couchdb/pull/3889#issuecomment-1020861208
> > >
> > > Thanks,
> > > -Nick
> > >
> > > On Tue, Jan 11, 2022 at 1:06 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >
> > > > I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> > > >
> > > > Would that work? There are two tricks there - re-using a field
> > > > position from an older <2.3.1 format, this should allow transparently
> > > > downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > > > map  so it should allow adding extra info to the views in the future
> > > > (custom collation tailorings?).
> > > >
> > > > Thanks,
> > > > -Nick
> > > >
> > > > On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > > > >
> > > > > Thanks, Adam. And thanks for the tip about the view header, Bob.
> > > > >
> > > > > Wonder if a disk version would make sense for views. Noticed Eric did
> > > > > a nice job transparently migrating 2.x -> 3.x view files when we
> > > > > removed key seq indices. Perhaps something like that would work for
> > > > > adding a collator version.
> > > > >
> > > > > Cheers,
> > > > > -Nick
> > > > >
> > > > > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > > > > >
> > > > > > That seems like a smart solution Nick.
> > > > > >
> > > > > > Adam
> > > > > >
> > > > > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > > > > >
> > > > > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > > > > >
> > > > > > > B.
> > > > > > >
> > > > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > > > > >>
> > > > > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > > > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > > > > >> happen in the background, while still serving the existing view data.
> > > > > > >>
> > > > > > >> The realization was that collation doesn't affect the emitted keys and
> > > > > > >> values themselves, only their order in the view b-trees. That means
> > > > > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > > > > >> compactor already does.
> > > > > > >>
> > > > > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > > > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > > > > >> which handles file version format upgrades, but we'll tweak that logic
> > > > > > >> to trigger on libicu version mismatches as well.
> > > > > > >>
> > > > > > >> Would this work? Does anyone see any issue with that approach?
> > > > > > >>
> > > > > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > > > > >>
> > > > > > >> Cheers,
> > > > > > >> -Nick
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > > > > >>>
> > > > > > >>> Hello everyone,
> > > > > > >>>
> > > > > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > > > > >>> When views are built, we do not record or track the version of the
> > > > > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > > > > >>> collation order between major libicu versions, and when that happens,
> > > > > > >>> views built with the older versions may experience data loss. I wanted
> > > > > > >>> to discuss the option to record the libicu collator version in each
> > > > > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > > > > >>> ignore the mismatch, or automatically rebuild the views.
> > > > > > >>>
> > > > > > >>> Imagine, for example, searching patient records using start/end keys.
> > > > > > >>> It could be possible that, say, the first letter of their name now
> > > > > > >>> collates differently in a new libicu. That would prevent the patient
> > > > > > >>> record from showing up in the view results for some important
> > > > > > >>> procedure or medication. Users might not even be aware of this kind of
> > > > > > >>> data loss occurring, there won't be any error in the API or warning in
> > > > > > >>> the logs.
> > > > > > >>>
> > > > > > >>> I was thinking how to solve this. There were a few commits already to
> > > > > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > > > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > > > > >>> minor fixes in that area. As the next steps we could:
> > > > > > >>>
> > > > > > >>> 1) Modify our views to keep track of the collation algorithm
> > > > > > >>> version. We could attempt to transparently upgrade the view header
> > > > > > >>> format -- read the old view file, update the header with an extra
> > > > > > >>> libicu collation version field, that updates the signature, and then,
> > > > > > >>> save the file with the new header and new signature. This avoids view
> > > > > > >>> rebuilds, just records the collator version in the view and moves the
> > > > > > >>> files to a new name.
> > > > > > >>>
> > > > > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > > > > >>> results when the current libicu version doesn't match the version in
> > > > > > >>> the view [3]. That means altering the view results to add a "warning":
> > > > > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > > > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > > > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > > > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > > > > >>> users which used the "raw" collation option, or know they are using
> > > > > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > > > > >>> configuration setting to ignore the warnings as well.
> > > > > > >>>
> > > > > > >>> 3) Users who see the warning, could then either rebuild the view
> > > > > > >>> with the new collator library manually, or it could happen
> > > > > > >>> automatically based on a configuration option, basically "when
> > > > > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > > > > >>> views".
> > > > > > >>>
> > > > > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > > > > >>> they double-checked the new ICU version and are convinced that a
> > > > > > >>> particular view would not experience data loss with the new collator.
> > > > > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > > > > >>> This can't be just a naive "collator" option setting as both per-view
> > > > > > >>> and per-design options are used when computing the view signature, and
> > > > > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > > > > >>> can add it to the design docs as a separate option which is excluded
> > > > > > >>> from the signature hash, like the "autoupdate" setting for background
> > > > > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > > > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > > > > >>>
> > > > > > >>> What do we think, is this a reasonable approach? Is there something
> > > > > > >>> easier / simpler we can do?
> > > > > > >>>
> > > > > > >>> Thanks!
> > > > > > >>> -Nick
> > > > > > >>>
> > > > > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > > > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > > > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > > > > >
> > > > > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Good idea, Will, to return the current collator version in the
`/_node/_local/_versions` output. We return the collation algorithm
and the library versions, however, since we switched to tracking the
opaque collator version, it's good to also expose that too.

On Tue, Jan 25, 2022 at 9:47 AM Will Young <lo...@gmail.com> wrote:
>
> Hi Nick,
>
>   The view _info setup looks good to me. Maybe it would be helpful to
> print the current runtime's collator and icu versions somewhere like
> the / meta or  /node/ _system endpoint? I think that would provide a
> way to cross-reference to alleviate the drawback of the collator being
> the least human readable version (though only to find a more readable
> version for the views that are from the current runtime,) and maybe to
> debug oddities like a cluster somehow having nodes that are out of
> sync on libicu versions, or just to make it easier to check if dbs are
> going to be rebuilt after an update. Of course there are also other
> ways for an admin to examine the current runtime and workout versions
> so it is probably a question of how frequently it will come up.
>
> Thanks,
> Will
>
> Am Di., 25. Jan. 2022 um 07:45 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
> >
> > Another update regarding the draft PR.
> >
> > There are now upgrade tests to see how we handle older 2.x, 3.2.1, and
> > views with multiple collator versions in them.
> >
> > The last commit modifies the _design/*/_info API to return the list of
> > collator versions to the user and wanted to see what everyone thought
> > about it:
> >
> > https://github.com/apache/couchdb/pull/3889#issuecomment-1020861208
> >
> > Thanks,
> > -Nick
> >
> > On Tue, Jan 11, 2022 at 1:06 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > >
> > > I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> > >
> > > Would that work? There are two tricks there - re-using a field
> > > position from an older <2.3.1 format, this should allow transparently
> > > downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > > map  so it should allow adding extra info to the views in the future
> > > (custom collation tailorings?).
> > >
> > > Thanks,
> > > -Nick
> > >
> > > On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >
> > > > Thanks, Adam. And thanks for the tip about the view header, Bob.
> > > >
> > > > Wonder if a disk version would make sense for views. Noticed Eric did
> > > > a nice job transparently migrating 2.x -> 3.x view files when we
> > > > removed key seq indices. Perhaps something like that would work for
> > > > adding a collator version.
> > > >
> > > > Cheers,
> > > > -Nick
> > > >
> > > > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > > > >
> > > > > That seems like a smart solution Nick.
> > > > >
> > > > > Adam
> > > > >
> > > > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > > > >
> > > > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > > > >
> > > > > > B.
> > > > > >
> > > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > > > >>
> > > > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > > > >> happen in the background, while still serving the existing view data.
> > > > > >>
> > > > > >> The realization was that collation doesn't affect the emitted keys and
> > > > > >> values themselves, only their order in the view b-trees. That means
> > > > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > > > >> compactor already does.
> > > > > >>
> > > > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > > > >> which handles file version format upgrades, but we'll tweak that logic
> > > > > >> to trigger on libicu version mismatches as well.
> > > > > >>
> > > > > >> Would this work? Does anyone see any issue with that approach?
> > > > > >>
> > > > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > > > >>
> > > > > >> Cheers,
> > > > > >> -Nick
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > > > >>>
> > > > > >>> Hello everyone,
> > > > > >>>
> > > > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > > > >>> When views are built, we do not record or track the version of the
> > > > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > > > >>> collation order between major libicu versions, and when that happens,
> > > > > >>> views built with the older versions may experience data loss. I wanted
> > > > > >>> to discuss the option to record the libicu collator version in each
> > > > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > > > >>> ignore the mismatch, or automatically rebuild the views.
> > > > > >>>
> > > > > >>> Imagine, for example, searching patient records using start/end keys.
> > > > > >>> It could be possible that, say, the first letter of their name now
> > > > > >>> collates differently in a new libicu. That would prevent the patient
> > > > > >>> record from showing up in the view results for some important
> > > > > >>> procedure or medication. Users might not even be aware of this kind of
> > > > > >>> data loss occurring, there won't be any error in the API or warning in
> > > > > >>> the logs.
> > > > > >>>
> > > > > >>> I was thinking how to solve this. There were a few commits already to
> > > > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > > > >>> minor fixes in that area. As the next steps we could:
> > > > > >>>
> > > > > >>> 1) Modify our views to keep track of the collation algorithm
> > > > > >>> version. We could attempt to transparently upgrade the view header
> > > > > >>> format -- read the old view file, update the header with an extra
> > > > > >>> libicu collation version field, that updates the signature, and then,
> > > > > >>> save the file with the new header and new signature. This avoids view
> > > > > >>> rebuilds, just records the collator version in the view and moves the
> > > > > >>> files to a new name.
> > > > > >>>
> > > > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > > > >>> results when the current libicu version doesn't match the version in
> > > > > >>> the view [3]. That means altering the view results to add a "warning":
> > > > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > > > >>> users which used the "raw" collation option, or know they are using
> > > > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > > > >>> configuration setting to ignore the warnings as well.
> > > > > >>>
> > > > > >>> 3) Users who see the warning, could then either rebuild the view
> > > > > >>> with the new collator library manually, or it could happen
> > > > > >>> automatically based on a configuration option, basically "when
> > > > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > > > >>> views".
> > > > > >>>
> > > > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > > > >>> they double-checked the new ICU version and are convinced that a
> > > > > >>> particular view would not experience data loss with the new collator.
> > > > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > > > >>> This can't be just a naive "collator" option setting as both per-view
> > > > > >>> and per-design options are used when computing the view signature, and
> > > > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > > > >>> can add it to the design docs as a separate option which is excluded
> > > > > >>> from the signature hash, like the "autoupdate" setting for background
> > > > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > > > >>>
> > > > > >>> What do we think, is this a reasonable approach? Is there something
> > > > > >>> easier / simpler we can do?
> > > > > >>>
> > > > > >>> Thanks!
> > > > > >>> -Nick
> > > > > >>>
> > > > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > > > >
> > > > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Will Young <lo...@gmail.com>.
Hi Nick,

  The view _info setup looks good to me. Maybe it would be helpful to
print the current runtime's collator and icu versions somewhere like
the / meta or  /node/ _system endpoint? I think that would provide a
way to cross-reference to alleviate the drawback of the collator being
the least human readable version (though only to find a more readable
version for the views that are from the current runtime,) and maybe to
debug oddities like a cluster somehow having nodes that are out of
sync on libicu versions, or just to make it easier to check if dbs are
going to be rebuilt after an update. Of course there are also other
ways for an admin to examine the current runtime and workout versions
so it is probably a question of how frequently it will come up.

Thanks,
Will

Am Di., 25. Jan. 2022 um 07:45 Uhr schrieb Nick Vatamaniuc <va...@gmail.com>:
>
> Another update regarding the draft PR.
>
> There are now upgrade tests to see how we handle older 2.x, 3.2.1, and
> views with multiple collator versions in them.
>
> The last commit modifies the _design/*/_info API to return the list of
> collator versions to the user and wanted to see what everyone thought
> about it:
>
> https://github.com/apache/couchdb/pull/3889#issuecomment-1020861208
>
> Thanks,
> -Nick
>
> On Tue, Jan 11, 2022 at 1:06 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> >
> > I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> >
> > Would that work? There are two tricks there - re-using a field
> > position from an older <2.3.1 format, this should allow transparently
> > downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > map  so it should allow adding extra info to the views in the future
> > (custom collation tailorings?).
> >
> > Thanks,
> > -Nick
> >
> > On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> > >
> > > Thanks, Adam. And thanks for the tip about the view header, Bob.
> > >
> > > Wonder if a disk version would make sense for views. Noticed Eric did
> > > a nice job transparently migrating 2.x -> 3.x view files when we
> > > removed key seq indices. Perhaps something like that would work for
> > > adding a collator version.
> > >
> > > Cheers,
> > > -Nick
> > >
> > > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > > >
> > > > That seems like a smart solution Nick.
> > > >
> > > > Adam
> > > >
> > > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > > >
> > > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > > >
> > > > > B.
> > > > >
> > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > > >>
> > > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > > >> happen in the background, while still serving the existing view data.
> > > > >>
> > > > >> The realization was that collation doesn't affect the emitted keys and
> > > > >> values themselves, only their order in the view b-trees. That means
> > > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > > >> compactor already does.
> > > > >>
> > > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > > >> which handles file version format upgrades, but we'll tweak that logic
> > > > >> to trigger on libicu version mismatches as well.
> > > > >>
> > > > >> Would this work? Does anyone see any issue with that approach?
> > > > >>
> > > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > > >>
> > > > >> Cheers,
> > > > >> -Nick
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > > >>>
> > > > >>> Hello everyone,
> > > > >>>
> > > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > > >>> When views are built, we do not record or track the version of the
> > > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > > >>> collation order between major libicu versions, and when that happens,
> > > > >>> views built with the older versions may experience data loss. I wanted
> > > > >>> to discuss the option to record the libicu collator version in each
> > > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > > >>> ignore the mismatch, or automatically rebuild the views.
> > > > >>>
> > > > >>> Imagine, for example, searching patient records using start/end keys.
> > > > >>> It could be possible that, say, the first letter of their name now
> > > > >>> collates differently in a new libicu. That would prevent the patient
> > > > >>> record from showing up in the view results for some important
> > > > >>> procedure or medication. Users might not even be aware of this kind of
> > > > >>> data loss occurring, there won't be any error in the API or warning in
> > > > >>> the logs.
> > > > >>>
> > > > >>> I was thinking how to solve this. There were a few commits already to
> > > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > > >>> minor fixes in that area. As the next steps we could:
> > > > >>>
> > > > >>> 1) Modify our views to keep track of the collation algorithm
> > > > >>> version. We could attempt to transparently upgrade the view header
> > > > >>> format -- read the old view file, update the header with an extra
> > > > >>> libicu collation version field, that updates the signature, and then,
> > > > >>> save the file with the new header and new signature. This avoids view
> > > > >>> rebuilds, just records the collator version in the view and moves the
> > > > >>> files to a new name.
> > > > >>>
> > > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > > >>> results when the current libicu version doesn't match the version in
> > > > >>> the view [3]. That means altering the view results to add a "warning":
> > > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > > >>> users which used the "raw" collation option, or know they are using
> > > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > > >>> configuration setting to ignore the warnings as well.
> > > > >>>
> > > > >>> 3) Users who see the warning, could then either rebuild the view
> > > > >>> with the new collator library manually, or it could happen
> > > > >>> automatically based on a configuration option, basically "when
> > > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > > >>> views".
> > > > >>>
> > > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > > >>> they double-checked the new ICU version and are convinced that a
> > > > >>> particular view would not experience data loss with the new collator.
> > > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > > >>> This can't be just a naive "collator" option setting as both per-view
> > > > >>> and per-design options are used when computing the view signature, and
> > > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > > >>> can add it to the design docs as a separate option which is excluded
> > > > >>> from the signature hash, like the "autoupdate" setting for background
> > > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > > >>>
> > > > >>> What do we think, is this a reasonable approach? Is there something
> > > > >>> easier / simpler we can do?
> > > > >>>
> > > > >>> Thanks!
> > > > >>> -Nick
> > > > >>>
> > > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > > >
> > > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Another update regarding the draft PR.

There are now upgrade tests to see how we handle older 2.x, 3.2.1, and
views with multiple collator versions in them.

The last commit modifies the _design/*/_info API to return the list of
collator versions to the user and wanted to see what everyone thought
about it:

https://github.com/apache/couchdb/pull/3889#issuecomment-1020861208

Thanks,
-Nick

On Tue, Jan 11, 2022 at 1:06 PM Nick Vatamaniuc <va...@gmail.com> wrote:
>
> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
>
> Would that work? There are two tricks there - re-using a field
> position from an older <2.3.1 format, this should allow transparently
> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> map  so it should allow adding extra info to the views in the future
> (custom collation tailorings?).
>
> Thanks,
> -Nick
>
> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
> >
> > Thanks, Adam. And thanks for the tip about the view header, Bob.
> >
> > Wonder if a disk version would make sense for views. Noticed Eric did
> > a nice job transparently migrating 2.x -> 3.x view files when we
> > removed key seq indices. Perhaps something like that would work for
> > adding a collator version.
> >
> > Cheers,
> > -Nick
> >
> > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> > >
> > > That seems like a smart solution Nick.
> > >
> > > Adam
> > >
> > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > > >
> > > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > > >
> > > > B.
> > > >
> > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > > >>
> > > >> Thinking more about this issue I wonder if we can avoid resetting and
> > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > >> happen in the background, while still serving the existing view data.
> > > >>
> > > >> The realization was that collation doesn't affect the emitted keys and
> > > >> values themselves, only their order in the view b-trees. That means
> > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > >> compactor already does.
> > > >>
> > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > >> which handles file version format upgrades, but we'll tweak that logic
> > > >> to trigger on libicu version mismatches as well.
> > > >>
> > > >> Would this work? Does anyone see any issue with that approach?
> > > >>
> > > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > >>
> > > >> Cheers,
> > > >> -Nick
> > > >>
> > > >>
> > > >>
> > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > > >>>
> > > >>> Hello everyone,
> > > >>>
> > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > >>> When views are built, we do not record or track the version of the
> > > >>> collation algorithm. The issue is that the ICU library may modify the
> > > >>> collation order between major libicu versions, and when that happens,
> > > >>> views built with the older versions may experience data loss. I wanted
> > > >>> to discuss the option to record the libicu collator version in each
> > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > >>> ignore the mismatch, or automatically rebuild the views.
> > > >>>
> > > >>> Imagine, for example, searching patient records using start/end keys.
> > > >>> It could be possible that, say, the first letter of their name now
> > > >>> collates differently in a new libicu. That would prevent the patient
> > > >>> record from showing up in the view results for some important
> > > >>> procedure or medication. Users might not even be aware of this kind of
> > > >>> data loss occurring, there won't be any error in the API or warning in
> > > >>> the logs.
> > > >>>
> > > >>> I was thinking how to solve this. There were a few commits already to
> > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > >>> minor fixes in that area. As the next steps we could:
> > > >>>
> > > >>> 1) Modify our views to keep track of the collation algorithm
> > > >>> version. We could attempt to transparently upgrade the view header
> > > >>> format -- read the old view file, update the header with an extra
> > > >>> libicu collation version field, that updates the signature, and then,
> > > >>> save the file with the new header and new signature. This avoids view
> > > >>> rebuilds, just records the collator version in the view and moves the
> > > >>> files to a new name.
> > > >>>
> > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > >>> results when the current libicu version doesn't match the version in
> > > >>> the view [3]. That means altering the view results to add a "warning":
> > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > > >>> users which used the "raw" collation option, or know they are using
> > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > >>> configuration setting to ignore the warnings as well.
> > > >>>
> > > >>> 3) Users who see the warning, could then either rebuild the view
> > > >>> with the new collator library manually, or it could happen
> > > >>> automatically based on a configuration option, basically "when
> > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > >>> views".
> > > >>>
> > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > >>> they double-checked the new ICU version and are convinced that a
> > > >>> particular view would not experience data loss with the new collator.
> > > >>> That should make the warning go away, and the view to not be rebuilt.
> > > >>> This can't be just a naive "collator" option setting as both per-view
> > > >>> and per-design options are used when computing the view signature, and
> > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > >>> can add it to the design docs as a separate option which is excluded
> > > >>> from the signature hash, like the "autoupdate" setting for background
> > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > >>>
> > > >>> What do we think, is this a reasonable approach? Is there something
> > > >>> easier / simpler we can do?
> > > >>>
> > > >>> Thanks!
> > > >>> -Nick
> > > >>>
> > > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > >
> > >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
I threw together a draft PR https://github.com/apache/couchdb/pull/3889

Would that work? There are two tricks there - re-using a field
position from an older <2.3.1 format, this should allow transparently
downgrading back to 3.2.1 as we ignore that field there. Also, used a
map  so it should allow adding extra info to the views in the future
(custom collation tailorings?).

Thanks,
-Nick

On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <va...@gmail.com> wrote:
>
> Thanks, Adam. And thanks for the tip about the view header, Bob.
>
> Wonder if a disk version would make sense for views. Noticed Eric did
> a nice job transparently migrating 2.x -> 3.x view files when we
> removed key seq indices. Perhaps something like that would work for
> adding a collator version.
>
> Cheers,
> -Nick
>
> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
> >
> > That seems like a smart solution Nick.
> >
> > Adam
> >
> > > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> > >
> > > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> > >
> > > B.
> > >
> > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> > >>
> > >> Thinking more about this issue I wonder if we can avoid resetting and
> > >> rebuilding everything from scratch, and instead, let the upgrade
> > >> happen in the background, while still serving the existing view data.
> > >>
> > >> The realization was that collation doesn't affect the emitted keys and
> > >> values themselves, only their order in the view b-trees. That means
> > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > >> compactor already does.
> > >>
> > >> When we detect a libicu version discrepancy we'd submit the view for
> > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > >> which handles file version format upgrades, but we'll tweak that logic
> > >> to trigger on libicu version mismatches as well.
> > >>
> > >> Would this work? Does anyone see any issue with that approach?
> > >>
> > >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > >>
> > >> Cheers,
> > >> -Nick
> > >>
> > >>
> > >>
> > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> > >>>
> > >>> Hello everyone,
> > >>>
> > >>> CouchDB by default uses the libicu library to sort its view rows.
> > >>> When views are built, we do not record or track the version of the
> > >>> collation algorithm. The issue is that the ICU library may modify the
> > >>> collation order between major libicu versions, and when that happens,
> > >>> views built with the older versions may experience data loss. I wanted
> > >>> to discuss the option to record the libicu collator version in each
> > >>> view then warn the user when there is a mismatch. Also, optionally
> > >>> ignore the mismatch, or automatically rebuild the views.
> > >>>
> > >>> Imagine, for example, searching patient records using start/end keys.
> > >>> It could be possible that, say, the first letter of their name now
> > >>> collates differently in a new libicu. That would prevent the patient
> > >>> record from showing up in the view results for some important
> > >>> procedure or medication. Users might not even be aware of this kind of
> > >>> data loss occurring, there won't be any error in the API or warning in
> > >>> the logs.
> > >>>
> > >>> I was thinking how to solve this. There were a few commits already to
> > >>> cleanup our collation drivers [1], expose libicu and collation
> > >>> algorithm version in the new _versions endpoint [2], and some other
> > >>> minor fixes in that area. As the next steps we could:
> > >>>
> > >>> 1) Modify our views to keep track of the collation algorithm
> > >>> version. We could attempt to transparently upgrade the view header
> > >>> format -- read the old view file, update the header with an extra
> > >>> libicu collation version field, that updates the signature, and then,
> > >>> save the file with the new header and new signature. This avoids view
> > >>> rebuilds, just records the collator version in the view and moves the
> > >>> files to a new name.
> > >>>
> > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > >>> results when the current libicu version doesn't match the version in
> > >>> the view [3]. That means altering the view results to add a "warning":
> > >>> "..." field. Another alternative 2b) is emit a warning in the
> > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > >>> version upgrade, or restoring backups, to make sure to look at their
> > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> > >>> users which used the "raw" collation option, or know they are using
> > >>> just the plain ASCII character sets in their views. So we'd have a
> > >>> configuration setting to ignore the warnings as well.
> > >>>
> > >>> 3) Users who see the warning, could then either rebuild the view
> > >>> with the new collator library manually, or it could happen
> > >>> automatically based on a configuration option, basically "when
> > >>> collator versions are miss-matched, invalidate and rebuild all the
> > >>> views".
> > >>>
> > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > >>> they double-checked the new ICU version and are convinced that a
> > >>> particular view would not experience data loss with the new collator.
> > >>> That should make the warning go away, and the view to not be rebuilt.
> > >>> This can't be just a naive "collator" option setting as both per-view
> > >>> and per-design options are used when computing the view signature, and
> > >>> any changes there would result in the view being rebuilt. Perhaps we
> > >>> can add it to the design docs as a separate option which is excluded
> > >>> from the signature hash, like the "autoupdate" setting for background
> > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > >>>
> > >>> What do we think, is this a reasonable approach? Is there something
> > >>> easier / simpler we can do?
> > >>>
> > >>> Thanks!
> > >>> -Nick
> > >>>
> > >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > >
> >

Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Thanks, Adam. And thanks for the tip about the view header, Bob.

Wonder if a disk version would make sense for views. Noticed Eric did
a nice job transparently migrating 2.x -> 3.x view files when we
removed key seq indices. Perhaps something like that would work for
adding a collator version.

Cheers,
-Nick

On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <ko...@apache.org> wrote:
>
> That seems like a smart solution Nick.
>
> Adam
>
> > On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> >
> > Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it.
> >
> > B.
> >
> >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> >>
> >> Thinking more about this issue I wonder if we can avoid resetting and
> >> rebuilding everything from scratch, and instead, let the upgrade
> >> happen in the background, while still serving the existing view data.
> >>
> >> The realization was that collation doesn't affect the emitted keys and
> >> values themselves, only their order in the view b-trees. That means
> >> we'd just have to rebuild b-trees, and that is exactly what our view
> >> compactor already does.
> >>
> >> When we detect a libicu version discrepancy we'd submit the view for
> >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> >> which handles file version format upgrades, but we'll tweak that logic
> >> to trigger on libicu version mismatches as well.
> >>
> >> Would this work? Does anyone see any issue with that approach?
> >>
> >> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> >>
> >> Cheers,
> >> -Nick
> >>
> >>
> >>
> >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
> >>>
> >>> Hello everyone,
> >>>
> >>> CouchDB by default uses the libicu library to sort its view rows.
> >>> When views are built, we do not record or track the version of the
> >>> collation algorithm. The issue is that the ICU library may modify the
> >>> collation order between major libicu versions, and when that happens,
> >>> views built with the older versions may experience data loss. I wanted
> >>> to discuss the option to record the libicu collator version in each
> >>> view then warn the user when there is a mismatch. Also, optionally
> >>> ignore the mismatch, or automatically rebuild the views.
> >>>
> >>> Imagine, for example, searching patient records using start/end keys.
> >>> It could be possible that, say, the first letter of their name now
> >>> collates differently in a new libicu. That would prevent the patient
> >>> record from showing up in the view results for some important
> >>> procedure or medication. Users might not even be aware of this kind of
> >>> data loss occurring, there won't be any error in the API or warning in
> >>> the logs.
> >>>
> >>> I was thinking how to solve this. There were a few commits already to
> >>> cleanup our collation drivers [1], expose libicu and collation
> >>> algorithm version in the new _versions endpoint [2], and some other
> >>> minor fixes in that area. As the next steps we could:
> >>>
> >>> 1) Modify our views to keep track of the collation algorithm
> >>> version. We could attempt to transparently upgrade the view header
> >>> format -- read the old view file, update the header with an extra
> >>> libicu collation version field, that updates the signature, and then,
> >>> save the file with the new header and new signature. This avoids view
> >>> rebuilds, just records the collator version in the view and moves the
> >>> files to a new name.
> >>>
> >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> >>> results when the current libicu version doesn't match the version in
> >>> the view [3]. That means altering the view results to add a "warning":
> >>> "..." field. Another alternative 2b) is emit a warning in the
> >>> _design/$ddoc/_info only. Users would have to know that after an OS
> >>> version upgrade, or restoring backups, to make sure to look at their
> >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> >>> users which used the "raw" collation option, or know they are using
> >>> just the plain ASCII character sets in their views. So we'd have a
> >>> configuration setting to ignore the warnings as well.
> >>>
> >>> 3) Users who see the warning, could then either rebuild the view
> >>> with the new collator library manually, or it could happen
> >>> automatically based on a configuration option, basically "when
> >>> collator versions are miss-matched, invalidate and rebuild all the
> >>> views".
> >>>
> >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> >>> they double-checked the new ICU version and are convinced that a
> >>> particular view would not experience data loss with the new collator.
> >>> That should make the warning go away, and the view to not be rebuilt.
> >>> This can't be just a naive "collator" option setting as both per-view
> >>> and per-design options are used when computing the view signature, and
> >>> any changes there would result in the view being rebuilt. Perhaps we
> >>> can add it to the design docs as a separate option which is excluded
> >>> from the signature hash, like the "autoupdate" setting for background
> >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> >>>
> >>> What do we think, is this a reasonable approach? Is there something
> >>> easier / simpler we can do?
> >>>
> >>> Thanks!
> >>> -Nick
> >>>
> >>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> >>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> >
>

Re: [DISCUSS] Handle libicu upgrades better

Posted by Adam Kocoloski <ko...@apache.org>.
That seems like a smart solution Nick.

Adam

> On Nov 19, 2021, at 7:28 AM, Robert Newson <bo...@rsn.io> wrote:
> 
> Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it. 
> 
> B. 
> 
>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
>> 
>> Thinking more about this issue I wonder if we can avoid resetting and
>> rebuilding everything from scratch, and instead, let the upgrade
>> happen in the background, while still serving the existing view data.
>> 
>> The realization was that collation doesn't affect the emitted keys and
>> values themselves, only their order in the view b-trees. That means
>> we'd just have to rebuild b-trees, and that is exactly what our view
>> compactor already does.
>> 
>> When we detect a libicu version discrepancy we'd submit the view for
>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
>> which handles file version format upgrades, but we'll tweak that logic
>> to trigger on libicu version mismatches as well.
>> 
>> Would this work? Does anyone see any issue with that approach?
>> 
>> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
>> 
>> Cheers,
>> -Nick
>> 
>> 
>> 
>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
>>> 
>>> Hello everyone,
>>> 
>>> CouchDB by default uses the libicu library to sort its view rows.
>>> When views are built, we do not record or track the version of the
>>> collation algorithm. The issue is that the ICU library may modify the
>>> collation order between major libicu versions, and when that happens,
>>> views built with the older versions may experience data loss. I wanted
>>> to discuss the option to record the libicu collator version in each
>>> view then warn the user when there is a mismatch. Also, optionally
>>> ignore the mismatch, or automatically rebuild the views.
>>> 
>>> Imagine, for example, searching patient records using start/end keys.
>>> It could be possible that, say, the first letter of their name now
>>> collates differently in a new libicu. That would prevent the patient
>>> record from showing up in the view results for some important
>>> procedure or medication. Users might not even be aware of this kind of
>>> data loss occurring, there won't be any error in the API or warning in
>>> the logs.
>>> 
>>> I was thinking how to solve this. There were a few commits already to
>>> cleanup our collation drivers [1], expose libicu and collation
>>> algorithm version in the new _versions endpoint [2], and some other
>>> minor fixes in that area. As the next steps we could:
>>> 
>>> 1) Modify our views to keep track of the collation algorithm
>>> version. We could attempt to transparently upgrade the view header
>>> format -- read the old view file, update the header with an extra
>>> libicu collation version field, that updates the signature, and then,
>>> save the file with the new header and new signature. This avoids view
>>> rebuilds, just records the collator version in the view and moves the
>>> files to a new name.
>>> 
>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
>>> results when the current libicu version doesn't match the version in
>>> the view [3]. That means altering the view results to add a "warning":
>>> "..." field. Another alternative 2b) is emit a warning in the
>>> _design/$ddoc/_info only. Users would have to know that after an OS
>>> version upgrade, or restoring backups, to make sure to look at their
>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
>>> users which used the "raw" collation option, or know they are using
>>> just the plain ASCII character sets in their views. So we'd have a
>>> configuration setting to ignore the warnings as well.
>>> 
>>> 3) Users who see the warning, could then either rebuild the view
>>> with the new collator library manually, or it could happen
>>> automatically based on a configuration option, basically "when
>>> collator versions are miss-matched, invalidate and rebuild all the
>>> views".
>>> 
>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
>>> they double-checked the new ICU version and are convinced that a
>>> particular view would not experience data loss with the new collator.
>>> That should make the warning go away, and the view to not be rebuilt.
>>> This can't be just a naive "collator" option setting as both per-view
>>> and per-design options are used when computing the view signature, and
>>> any changes there would result in the view being rebuilt. Perhaps we
>>> can add it to the design docs as a separate option which is excluded
>>> from the signature hash, like the "autoupdate" setting for background
>>> index builder ("collation_version_accept"?). PostgreSQL also offers
>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>>> 
>>> What do we think, is this a reasonable approach? Is there something
>>> easier / simpler we can do?
>>> 
>>> Thanks!
>>> -Nick
>>> 
>>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
>>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> 


Re: [DISCUSS] Handle libicu upgrades better

Posted by Robert Newson <bo...@rsn.io>.
Noting that the upgrade channel for views was misconceived (by me) as there is no version number in the header for them. You’d need to add it. 

B. 

> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <va...@gmail.com> wrote:
> 
> Thinking more about this issue I wonder if we can avoid resetting and
> rebuilding everything from scratch, and instead, let the upgrade
> happen in the background, while still serving the existing view data.
> 
> The realization was that collation doesn't affect the emitted keys and
> values themselves, only their order in the view b-trees. That means
> we'd just have to rebuild b-trees, and that is exactly what our view
> compactor already does.
> 
> When we detect a libicu version discrepancy we'd submit the view for
> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> which handles file version format upgrades, but we'll tweak that logic
> to trigger on libicu version mismatches as well.
> 
> Would this work? Does anyone see any issue with that approach?
> 
> [1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> 
> Cheers,
> -Nick
> 
> 
> 
>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
>> 
>> Hello everyone,
>> 
>> CouchDB by default uses the libicu library to sort its view rows.
>> When views are built, we do not record or track the version of the
>> collation algorithm. The issue is that the ICU library may modify the
>> collation order between major libicu versions, and when that happens,
>> views built with the older versions may experience data loss. I wanted
>> to discuss the option to record the libicu collator version in each
>> view then warn the user when there is a mismatch. Also, optionally
>> ignore the mismatch, or automatically rebuild the views.
>> 
>> Imagine, for example, searching patient records using start/end keys.
>> It could be possible that, say, the first letter of their name now
>> collates differently in a new libicu. That would prevent the patient
>> record from showing up in the view results for some important
>> procedure or medication. Users might not even be aware of this kind of
>> data loss occurring, there won't be any error in the API or warning in
>> the logs.
>> 
>> I was thinking how to solve this. There were a few commits already to
>> cleanup our collation drivers [1], expose libicu and collation
>> algorithm version in the new _versions endpoint [2], and some other
>> minor fixes in that area. As the next steps we could:
>> 
>>  1) Modify our views to keep track of the collation algorithm
>> version. We could attempt to transparently upgrade the view header
>> format -- read the old view file, update the header with an extra
>> libicu collation version field, that updates the signature, and then,
>> save the file with the new header and new signature. This avoids view
>> rebuilds, just records the collator version in the view and moves the
>> files to a new name.
>> 
>>  2) Do what PostgreSQL does, and 2a) emit a warning with the view
>> results when the current libicu version doesn't match the version in
>> the view [3]. That means altering the view results to add a "warning":
>> "..." field. Another alternative 2b) is emit a warning in the
>> _design/$ddoc/_info only. Users would have to know that after an OS
>> version upgrade, or restoring backups, to make sure to look at their
>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
>> users which used the "raw" collation option, or know they are using
>> just the plain ASCII character sets in their views. So we'd have a
>> configuration setting to ignore the warnings as well.
>> 
>>  3) Users who see the warning, could then either rebuild the view
>> with the new collator library manually, or it could happen
>> automatically based on a configuration option, basically "when
>> collator versions are miss-matched, invalidate and rebuild all the
>> views".
>> 
>>  4) We'd have a way for the users to assert (POST a ddoc update) that
>> they double-checked the new ICU version and are convinced that a
>> particular view would not experience data loss with the new collator.
>> That should make the warning go away, and the view to not be rebuilt.
>> This can't be just a naive "collator" option setting as both per-view
>> and per-design options are used when computing the view signature, and
>> any changes there would result in the view being rebuilt. Perhaps we
>> can add it to the design docs as a separate option which is excluded
>> from the signature hash, like the "autoupdate" setting for background
>> index builder ("collation_version_accept"?). PostgreSQL also offers
>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>> 
>> What do we think, is this a reasonable approach? Is there something
>> easier / simpler we can do?
>> 
>> Thanks!
>> -Nick
>> 
>> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
>> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html


Re: [DISCUSS] Handle libicu upgrades better

Posted by Nick Vatamaniuc <va...@gmail.com>.
Thinking more about this issue I wonder if we can avoid resetting and
rebuilding everything from scratch, and instead, let the upgrade
happen in the background, while still serving the existing view data.

The realization was that collation doesn't affect the emitted keys and
values themselves, only their order in the view b-trees. That means
we'd just have to rebuild b-trees, and that is exactly what our view
compactor already does.

When we detect a libicu version discrepancy we'd submit the view for
compaction. We even have a dedicated "upgrade" [1] channel in smoosh
which handles file version format upgrades, but we'll tweak that logic
to trigger on libicu version mismatches as well.

Would this work? Does anyone see any issue with that approach?

[1] https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442

Cheers,
-Nick



On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <va...@apache.org> wrote:
>
> Hello everyone,
>
> CouchDB by default uses the libicu library to sort its view rows.
> When views are built, we do not record or track the version of the
> collation algorithm. The issue is that the ICU library may modify the
> collation order between major libicu versions, and when that happens,
> views built with the older versions may experience data loss. I wanted
> to discuss the option to record the libicu collator version in each
> view then warn the user when there is a mismatch. Also, optionally
> ignore the mismatch, or automatically rebuild the views.
>
> Imagine, for example, searching patient records using start/end keys.
> It could be possible that, say, the first letter of their name now
> collates differently in a new libicu. That would prevent the patient
> record from showing up in the view results for some important
> procedure or medication. Users might not even be aware of this kind of
> data loss occurring, there won't be any error in the API or warning in
> the logs.
>
> I was thinking how to solve this. There were a few commits already to
> cleanup our collation drivers [1], expose libicu and collation
> algorithm version in the new _versions endpoint [2], and some other
> minor fixes in that area. As the next steps we could:
>
>   1) Modify our views to keep track of the collation algorithm
> version. We could attempt to transparently upgrade the view header
> format -- read the old view file, update the header with an extra
> libicu collation version field, that updates the signature, and then,
> save the file with the new header and new signature. This avoids view
> rebuilds, just records the collator version in the view and moves the
> files to a new name.
>
>   2) Do what PostgreSQL does, and 2a) emit a warning with the view
> results when the current libicu version doesn't match the version in
> the view [3]. That means altering the view results to add a "warning":
> "..." field. Another alternative 2b) is emit a warning in the
> _design/$ddoc/_info only. Users would have to know that after an OS
> version upgrade, or restoring backups, to make sure to look at their
> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> users which used the "raw" collation option, or know they are using
> just the plain ASCII character sets in their views. So we'd have a
> configuration setting to ignore the warnings as well.
>
>   3) Users who see the warning, could then either rebuild the view
> with the new collator library manually, or it could happen
> automatically based on a configuration option, basically "when
> collator versions are miss-matched, invalidate and rebuild all the
> views".
>
>   4) We'd have a way for the users to assert (POST a ddoc update) that
> they double-checked the new ICU version and are convinced that a
> particular view would not experience data loss with the new collator.
> That should make the warning go away, and the view to not be rebuilt.
> This can't be just a naive "collator" option setting as both per-view
> and per-design options are used when computing the view signature, and
> any changes there would result in the view being rebuilt. Perhaps we
> can add it to the design docs as a separate option which is excluded
> from the signature hash, like the "autoupdate" setting for background
> index builder ("collation_version_accept"?). PostgreSQL also offers
> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>
> What do we think, is this a reasonable approach? Is there something
> easier / simpler we can do?
>
> Thanks!
> -Nick
>
> [1] https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> [2] https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> [3] https://www.postgresql.org/docs/13/sql-altercollation.html