You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Cassandra Targett <ca...@gmail.com> on 2019/03/08 17:34:15 UTC

Re: [DISCUSS] Opening old indices for reading

I have a question about Simon’s commit that he discussed in an earlier mail to this thread, found at https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752

I see the commit diffs and files changed in GitHub at the above URL, but one odd thing about it is that it doesn’t refer to any branch and a scan of the code doesn’t show these changes at all. I looked for branches and PRs and didn’t find anything that jumped out at me. There also weren’t any notifications to the commits@lucene.a.o list about these changes.

So, were the changes really made? Was it just intended as some code for discussion, or was it meant to be in master branch? If the former, how does one make a commit without a branch? If the change was intended to be in master, though, it seems something has gone awry and we should try to fix it.

Cassandra
On Jan 31, 2019, 8:23 AM -0600, Adrien Grand <jp...@gmail.com>, wrote:
> This looks reasonable to me.
>
> On Tue, Jan 29, 2019 at 4:23 PM Simon Willnauer
> <si...@gmail.com> wrote:
> >
> > thanks folks,
> >
> > these are all good points. I created a first cut of what I had in mind
> > [1] . It's relatively simple and from a java visibility perspective
> > the only change that a user can take advantage of is this [2] and this
> > [3] respectively. This would allow opening indices back to Lucene 7.0
> > given that the codecs and postings formats are available. From a
> > documentation perspective I added [4]. Thisi s a pure read-only change
> > and doesn't allow opening these indices for writing. You can't merge
> > them neither would you be able to open an index writer on top of it. I
> > still need to add support to Check-Index but that's what it is
> > basically.
> >
> > lemme know what you think,
> >
> > simon
> > [1] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752
> > [2] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689
> > [3] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306
> > [4] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86
> >
> > On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> > >
> > > Another example is long ago Lucene allowed pos=-1 to be indexed and it caused all sorts of problems. We also stopped allowing positions close to Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382). Yet another is allowing negative vInts which are possible but horribly inefficient (https://issues.apache.org/jira/browse/LUCENE-3738).
> > >
> > > We do need to be free to fix these problems and then know after N+2 releases that no index can have the issue.
> > >
> > > I like the idea of providing "expert" / best effort / limited way of carrying forward such ancient indices, but I think the huge challenge for someone using that tool on an important index will be enumerating the list of issues that might "matter" (the 3 Adrien listed + the 3 I listed above is a start for this list) and taking appropriate steps to "correct" the index if so. E.g. on a norms encoding change, somehow these expert tools must decode norms the old way, encode them the new way, and then rewrite the norms files. Or if the index has pos=-1, changing that to pos=0. Or if it has negative vInts, ... etc.
> > >
> > > Or maybe the "special" DirectoryReader only reads stored fields? And so you would enumerate your _source and reindex into the latest format ...
> > >
> > > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > > > help make it harder to introduce corrupt data in an index.
> > >
> > > +1
> > >
> > > Every time we catch something like "don't allow pos = -1 into the index" we need somehow remember to go and add the check also in addIndices.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand <jp...@gmail.com> wrote:
> > > >
> > > > Agreed with Michael that setting expectations is going to be
> > > > important. The thing that I would like to make sure is that we would
> > > > never refrain from moving Lucene forward because of this feature. In
> > > > particular, lucene-core should be free to make assumptions that are
> > > > valid for N and N-1 indices without worrying about the fact that we
> > > > have this super-expert feature that allows opening older indices. Here
> > > > are some assumptions that I have in mind which have not always been
> > > > true:
> > > > - norms might be encoded in a different way (this changed in 7)
> > > > - all index files have a checksum (only true since Lucene 5)
> > > > - offsets are always going forward (only enforced since Lucene 7)
> > > >
> > > > This means that carrying indices over by just merging them with the
> > > > new version to move them to a new codec won't work all the time. For
> > > > instance if your index has backward offsets and new codecs assume that
> > > > offsets are going forward, then merging might fail or corrupt offsets
> > > > - I'd like to make sure that we would not consider this a bug.
> > > >
> > > > Erick, I don't think this feature would be suitable for "robust index
> > > > upgrades". To me it is really a best effort and shouldn't be trusted
> > > > too much.
> > > >
> > > > I think some users will be tempted to wrap old readers to make them
> > > > look good and then add them back to an index using addIndexes?
> > > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would
> > > > help make it harder to introduce corrupt data in an index.
> > > >
> > > > On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer
> > > > <si...@gmail.com> wrote:
> > > > >
> > > > > Hey folks,
> > > > >
> > > > > tl;dr; I want to be able to open an indexreader on an old index if the
> > > > > SegmentInfo version is supported and all segment codecs are available.
> > > > > Today that's not possible even if I port old formats to current
> > > > > versions.
> > > > >
> > > > > Our BWC policy for quite a while has been N-1 major versions. That's
> > > > > good and I think we should keep it that way. Only recently, caused by
> > > > > changes how we encode/decode norms we also hard-enforce a the
> > > > > index-version-created in several places and the version a segment was
> > > > > written with. These are great enforcements and I understand why. My
> > > > > request here is if we can find consensus on allowing somehow (a
> > > > > special DirectoryReader for instance) to open such an index for
> > > > > reading only that doesn't provide the guarantees that our high level
> > > > > APIs decode norms correctly for instance. This would be enough to for
> > > > > instance consume stored fields etc. for reindexing or if a users are
> > > > > aware do they norms decoding in the codec. I am happy to work on a
> > > > > proposal how this would work. It would still enforce no writing or
> > > > > anything like this. I am also all for putting such a reader into misc
> > > > > and being experimental.
> > > > >
> > > > > simon
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>