You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Rob Staveley (Tom)" <rs...@seseit.com> on 2009/12/11 18:20:56 UTC

Lucene 3.0.0 writer with a Lucene 2.3.1 index

I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to go
into production and writers in the process of upgrading to 3.0.0.

I think understand the implications of
http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats for
the upgrade, but I'd love it if someone could validate my following
assumptions.

  1. My 2.3.1 indexes have compressed fields in them, which the 3.0.0
readers work nicely with, as expected. I should assume that my 3.0.0 readers
will continue to handle 2.3.1 indexes OK.

  2. Presumably Lucene all future 3.x index readers will continue to handle
compressed fields and we should only anticipate Lucene 4.x choking on them.

I was naively expecting my index directories to grow when my 3.0.0 index
writer merged the 2.3.1 indexes and/or optimize()'d them converting them to
3.0.0. However, I don't see that. Presumably that means that....

  3. Documents added to existing 2.3.1 indexes will be added conforming to
3.0.0, but existing documents in the index will continue to have compressed
content and old documents can coexist happily with the new ones, and my
indexes will become a mixture of 2.3.1 and 3.0.0.

  4. I should use
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/util/Version.h
tml#LUCENE_23 for the StandardAnalyzer and QueryParser in mixed indexes in
3.0.0 if I want to handle analysis consistently, or go for LUCENE_CURRENT if
I want to handle the new content "better" (bearing in mind that the new
content will eventually replace the old content anyhow).

  5. I should use
http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/StopF
ilter.html#StopFilter%28boolean,%20org.apache.lucene.analysis.TokenStream,%2
0java.util.Set%29 with enablePositionIncrements=false in mixed indexes in
3.0.0 if I want to handle analysis consistently, or go for
enablePositionIncrements=true if I want to handle the new content "better"
(bearing in mind that the new content will eventually replace the old
content anyhow).


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

Posted by "Rob Staveley (Tom)" <rs...@seseit.com>.

Thanks for picking up on this Anshum and Uwe.

I used the following approach to convert by 2.3 index (which yes, was
optimised already) to 3.0... 

   Using 3.0 Lucene, I created a new empty index with my IndexWriter. I
opened my 2.3 index with an IndexReader. I added the 2.3 index with
writer.addIndexes(reader) and then optimized and committed. I assume that
counts as 2 segments being optimized, despite the fact that my new segment
would have been empty. Of my 3 indexes I noticed a small growth in the index
which has no compressed fields and a very small shrink in the two indexes
which did have compressed fields.

So, it looks like it wasn't a no-op, but looks like I was compressing <1K
fields, as Uwe suspected. Typically these were synopsis fields with
3-sentence extracts from the texts being indexed. I hadn't realised that the
threshold was as high as 1K to pay dividends. I would have been better off
not compressing those fields.

It looks like I'll benefit from Lucene 3 stopping me from abusing
compression! 8-)

Many thanks!

-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: 11 December 2009 18:43
To: java-user@lucene.apache.org
Subject: RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

The index *should* grow after merging/optimizing, but it will only do this,
if the fields you had compressed were not bigger then without compression.
One of the tests showed: A string field with 80 ascii chars needed
compressed about 250 bytes, which is 3 times (as chars are UTF-8 encoded)
the uncompressed size. So it was always a bad idea to compress only short
fields, compression for say fields<1024 chars is simply waste of time and
disk space.

So maybe you hit bthis issue: Some fields were so small that the compressed
representation were larger than uncompressed. And others the other way
round. This leads to o/o change.

By the way, if your index was already optimized in 2.3 and you try to
optimize it in 3.0, it will be a no-op, as optimization needs at least two
segments to merge.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Anshum [mailto:anshumg@gmail.com]
> Sent: Friday, December 11, 2009 7:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 3.0.0 writer with a Lucene 2.3.1 index
> 
> Hi Tom,
> Pt 3: As per my knowledge, it wouldn't be a 'mixture' of 2 index types.
> Rather, as soon as you optimize (or do a IndexWriter operation on the
> current index), it would expand the index to a non compressed format. I
> read
> it somewhere in the release notes that on doing so, a growth in the index
> size should be anticipated and handled.
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Fri, Dec 11, 2009 at 10:50 PM, Rob Staveley (Tom)
> <rs...@seseit.com>wrote:
> 
> > I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to
> go
> > into production and writers in the process of upgrading to 3.0.0.
> >
> > I think understand the implications of
> > http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats
> for
> > the upgrade, but I'd love it if someone could validate my following
> > assumptions.
> >
> >  1. My 2.3.1 indexes have compressed fields in them, which the 3.0.0
> > readers work nicely with, as expected. I should assume that my 3.0.0
> > readers
> > will continue to handle 2.3.1 indexes OK.
> >
> >  2. Presumably Lucene all future 3.x index readers will continue to
> handle
> > compressed fields and we should only anticipate Lucene 4.x choking on
> them.
> >
> > I was naively expecting my index directories to grow when my 3.0.0 index
> > writer merged the 2.3.1 indexes and/or optimize()'d them converting them
> to
> > 3.0.0. However, I don't see that. Presumably that means that....
> >
> >  3. Documents added to existing 2.3.1 indexes will be added conforming
> to
> > 3.0.0, but existing documents in the index will continue to have
> compressed
> > content and old documents can coexist happily with the new ones, and my
> > indexes will become a mixture of 2.3.1 and 3.0.0.
> >
> >  4. I should use
> >
> >
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/util/Version
> .h
> > tml#LUCENE_23 for the StandardAnalyzer and QueryParser in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> LUCENE_CURRENT
> > if
> > I want to handle the new content "better" (bearing in mind that the new
> > content will eventually replace the old content anyhow).
> >
> >  5. I should use
> >
> >
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/Sto
> pF
> >
> >
> ilter.html#StopFilter%28boolean,%20org.apache.lucene.analysis.TokenStream,
> %2
> > 0java.util.Set%29 with enablePositionIncrements=false in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> > enablePositionIncrements=true if I want to handle the new content
> "better"
> > (bearing in mind that the new content will eventually replace the old
> > content anyhow).
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

Posted by Uwe Schindler <uw...@thetaphi.de>.

The index *should* grow after merging/optimizing, but it will only do this,
if the fields you had compressed were not bigger then without compression.
One of the tests showed: A string field with 80 ascii chars needed
compressed about 250 bytes, which is 3 times (as chars are UTF-8 encoded)
the uncompressed size. So it was always a bad idea to compress only short
fields, compression for say fields<1024 chars is simply waste of time and
disk space.

So maybe you hit bthis issue: Some fields were so small that the compressed
representation were larger than uncompressed. And others the other way
round. This leads to o/o change.

By the way, if your index was already optimized in 2.3 and you try to
optimize it in 3.0, it will be a no-op, as optimization needs at least two
segments to merge.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Anshum [mailto:anshumg@gmail.com]
> Sent: Friday, December 11, 2009 7:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 3.0.0 writer with a Lucene 2.3.1 index
> 
> Hi Tom,
> Pt 3: As per my knowledge, it wouldn't be a 'mixture' of 2 index types.
> Rather, as soon as you optimize (or do a IndexWriter operation on the
> current index), it would expand the index to a non compressed format. I
> read
> it somewhere in the release notes that on doing so, a growth in the index
> size should be anticipated and handled.
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Fri, Dec 11, 2009 at 10:50 PM, Rob Staveley (Tom)
> <rs...@seseit.com>wrote:
> 
> > I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to
> go
> > into production and writers in the process of upgrading to 3.0.0.
> >
> > I think understand the implications of
> > http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats
> for
> > the upgrade, but I'd love it if someone could validate my following
> > assumptions.
> >
> >  1. My 2.3.1 indexes have compressed fields in them, which the 3.0.0
> > readers work nicely with, as expected. I should assume that my 3.0.0
> > readers
> > will continue to handle 2.3.1 indexes OK.
> >
> >  2. Presumably Lucene all future 3.x index readers will continue to
> handle
> > compressed fields and we should only anticipate Lucene 4.x choking on
> them.
> >
> > I was naively expecting my index directories to grow when my 3.0.0 index
> > writer merged the 2.3.1 indexes and/or optimize()'d them converting them
> to
> > 3.0.0. However, I don't see that. Presumably that means that....
> >
> >  3. Documents added to existing 2.3.1 indexes will be added conforming
> to
> > 3.0.0, but existing documents in the index will continue to have
> compressed
> > content and old documents can coexist happily with the new ones, and my
> > indexes will become a mixture of 2.3.1 and 3.0.0.
> >
> >  4. I should use
> >
> >
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/util/Version
> .h
> > tml#LUCENE_23 for the StandardAnalyzer and QueryParser in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> LUCENE_CURRENT
> > if
> > I want to handle the new content "better" (bearing in mind that the new
> > content will eventually replace the old content anyhow).
> >
> >  5. I should use
> >
> >
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/Sto
> pF
> >
> >
> ilter.html#StopFilter%28boolean,%20org.apache.lucene.analysis.TokenStream,
> %2
> > 0java.util.Set%29 with enablePositionIncrements=false in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> > enablePositionIncrements=true if I want to handle the new content
> "better"
> > (bearing in mind that the new content will eventually replace the old
> > content anyhow).
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 3.0.0 writer with a Lucene 2.3.1 index

Posted by Anshum <an...@gmail.com>.

Hi Tom,
Pt 3: As per my knowledge, it wouldn't be a 'mixture' of 2 index types.
Rather, as soon as you optimize (or do a IndexWriter operation on the
current index), it would expand the index to a non compressed format. I read
it somewhere in the release notes that on doing so, a growth in the index
size should be anticipated and handled.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Fri, Dec 11, 2009 at 10:50 PM, Rob Staveley (Tom)
<rs...@seseit.com>wrote:

> I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to go
> into production and writers in the process of upgrading to 3.0.0.
>
> I think understand the implications of
> http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats for
> the upgrade, but I'd love it if someone could validate my following
> assumptions.
>
>  1. My 2.3.1 indexes have compressed fields in them, which the 3.0.0
> readers work nicely with, as expected. I should assume that my 3.0.0
> readers
> will continue to handle 2.3.1 indexes OK.
>
>  2. Presumably Lucene all future 3.x index readers will continue to handle
> compressed fields and we should only anticipate Lucene 4.x choking on them.
>
> I was naively expecting my index directories to grow when my 3.0.0 index
> writer merged the 2.3.1 indexes and/or optimize()'d them converting them to
> 3.0.0. However, I don't see that. Presumably that means that....
>
>  3. Documents added to existing 2.3.1 indexes will be added conforming to
> 3.0.0, but existing documents in the index will continue to have compressed
> content and old documents can coexist happily with the new ones, and my
> indexes will become a mixture of 2.3.1 and 3.0.0.
>
>  4. I should use
>
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/util/Version.h
> tml#LUCENE_23 for the StandardAnalyzer and QueryParser in mixed indexes in
> 3.0.0 if I want to handle analysis consistently, or go for LUCENE_CURRENT
> if
> I want to handle the new content "better" (bearing in mind that the new
> content will eventually replace the old content anyhow).
>
>  5. I should use
>
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/StopF
>
> ilter.html#StopFilter%28boolean,%20org.apache.lucene.analysis.TokenStream,%2
> 0java.util.Set%29 with enablePositionIncrements=false in mixed indexes in
> 3.0.0 if I want to handle analysis consistently, or go for
> enablePositionIncrements=true if I want to handle the new content "better"
> (bearing in mind that the new content will eventually replace the old
> content anyhow).
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>