You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Allan Hill <pa...@metajure.com> on 2011/11/16 22:55:24 UTC

Upgrade Path Lucene 3.0.2 to 3.4

As it says in the title, we are moving from 3.0.2 from to 3.4. I am interested in issues about the need to build a new index or just keep changing the current one. My company has been busy building software and have not upgraded the Lucene and Tika libraries since last year, but I'm trying to remedy that as quickly as I can. We have production indices with 5,000,000 to 1,000,000 English language documents. These are business documents (the usual MS word, PDF ... ) which only the very occasional phrases in other character sets (for example, Japanese or Chinese company name inserted in an otherwise English document etc.).

So here are my high-level questions when doing such an upgrade jump

1. Do we need to start from scratch and create a new index or can I re-crawl documents into the existing index?
My impression is that, if we were using 2.x the answer would definitely be that a rebuild is required, but the answer doesn't jump out at me in releases since then. I think the answer seems to be no.

2. If we don't HAVE TO RE-CREATE the index, are their advantages to doing this?

a. Should I be looking into eventually leveraging org.apache.lucene.index.IndexUpgrader (see LUCENE-3082<http://issues.apache.org/jira/browse/LUCENE-3082>)?

In our application there is one Lucene "service" running in this system and it will be running the latest code, so there is no issues of old code needing to access the index.

Because of the improvements over the last year in Tika, we will set our system to re-crawl all documents, so I believe this eliminates various issues involving tokenizing fixes.
We have tests which demonstrate the new Lucene libraries when used to index and then search return the same (or improved) results. We also have tests to verify that Tika does a great job of improving its ability to parse (three cheers to the Tika folks for parsing half the previously failing PDF and 40% of the old MS Word-95 docs). Hats off to the folks involved in both - great job on both bug fixes and the new features!

But my question is about (1) updating libraries, but (2) using an existing index that will have all documents (eventually) replaced. Given my scenario what our my issues, if any? I attempt to answer my own question below and I think the answer is I don't need to create a new clean index.
I would be interested in any feedback.

-Paul
p.s. If I had one suggestion, I would suggest that in the release note summary of a bug, it would be better form to eliminate any shorthand acronyms (or just throw in a link to either an appropriate description or even the JavaDoc). Obviously, in the bug discussion there will be all kinds of terse usage, but one liners in release notes are read by folks a little less informed about some of the parts of Lucene.

*********** Detailed Review Follows *******

Reviewing the releases at http://lucene.apache.org/java/docs/index.html

The Java 7 JVM optimization bug has been fixed. This is great; we were aware of this, so never used Java 7.

The Unicode changes across JVMs referenced in the Java 7 and other JVM upgrades is interesting.
See for example the copy at:
https://github.com/apache/lucene-solr/blob/trunk/lucene/JRE_VERSION_MIGRATION.txt

In my case, we will be running the code under Java 7 while re-indexing, so I think all will be properly upgraded.

Reviewing the 3.4 bugs there only seem to be few that relate to the files in the index on disk:
LUCENE-3409<http://issues.apache.org/jira/browse/LUCENE-3409>: IndexWriter.deleteAll was [....], leading to unused files accumulating in the Directory.
My Comment: Curiously the details for this bug describe a memory leak, not a problem with files on disk, but anyway we aren't using Near Real-Time Readers (yet) and only use deleteAll when testing in test indexes.

LUCENE-3358<http://issues.apache.org/jira/browse/LUCENE-3358>, LUCENE-3361<http://issues.apache.org/jira/browse/LUCENE-3361>: StandardTokenizer and UAX29URLEmailTokenizer wrongly [...in ...] Han or Hiragana characters...
My Comment: This (if even relevant to us) would be fixed by re-indexing which we will be doing anyway.

LUCENE-3368<http://issues.apache.org/jira/browse/LUCENE-3368> IndexWriter applies wrong deletes during concurrent flush-all
My Comment: Only occurs when there are two writers which we don't have. I thought only one writer was allowed, so I'm really not grokking this bug. Can any explain this one to me?

LUCENE-3365<http://issues.apache.org/jira/browse/LUCENE-3365>: ... can cause IndexWriter overriding an existing index.
My Comment: I think we would have known about this one if it did occur in our system, but it is now fixed.

LUCENE-3418<http://issues.apache.org/jira/browse/LUCENE-3418>: Lucene was failing to fsync index files on commit, meaning an operating system or hardware crash, or power loss, could easily corrupt the index.
My Comment: This is the issue mentioned in the release announcement. Luckily for us, even though we've had production environments crash during a power outage, we didn't see this.
Reading the notes on this, it seems this was a hard fail that was obvious when it occurred.

Reviewing the 3.3 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing.
Reviewing the 3.2.0 release:
LUCENE-3065<http://issues.apache.org/jira/browse/LUCENE-3065>: In API changes it says, Document.getField() was deprecated. In changes in runtime behavior it says "... Document.getFieldable() returns NumericField instances".
My Comment: We have more than one numeric fields in our index so have moved to using the Document.getFieldable(), so we're doing this the right way.

Reviewing 3.1.0 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing documents (for example LUCENE-2911<http://issues.apache.org/jira/browse/LUCENE-2911>).
Reviewing 3.0.3 release:
There appear to be no bugs which effected the files on disk that are not fixed by re-indexing documents.

That doesn't seem bad at all!
Comments?