You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2012/11/27 22:10:04 UTC

Extreme index size reduction on 4.1-SNAPSHOT?

With a 4.1 snapshot from a couple of weeks ago, I saw about a 5% drop in 
index size compared to 3.5.0 when using the same schema. When I updated 
my 4.1 schema to ICUTokenizer so I could use CJKBigramFilter, my index 
dropped further -- about 10% less than 3.5, still using the same 4.1 
snapshot.

Yesterday I checked out the newest 4.1 snapshot and built the index 
again.  Comparing a recently optimized 3.5.0 index with the same 
recently optimized index under the new 4.1, I am seeing more than a 30 
percent drop in size -- 15.49GB instead of 22.7 GB.  As noted above, 
some of that drop can be explained by the change in schema, but not THAT 
much.  I am very impressed.

Looking at the index directories from yesterday compared to what I 
remember about the directories a couple of weeks ago, it appears that 
some of the files that had Lucene40 in the filename now have Lucene41 in 
the filename.

Is there any chance that this is an indication of a problem, or is the 
expected index reduction really that good?

Thanks,
Shawn


Re: Extreme index size reduction on 4.1-SNAPSHOT?

Posted by Shawn Heisey <so...@elyograg.org>.
On 11/27/2012 2:25 PM, Markus Jelsma wrote:
> Hi, please check this issue:
> https://issues.apache.org/jira/browse/LUCENE-4226
>
> But it is enabled because of:
> https://issues.apache.org/jira/browse/LUCENE-4509
>
> Since it's suddenly default you would have to completely wipe the index and reindex the data, at least i had to, because of numerous codec exceptions. It significantly reduced very large indexes we have.

I noticed the exceptions when I tried to restart after updating the 
.war.  I stopped Solr, completely wiped out my data directories, and ran 
a DIH full-import on all shards after starting back up.  The almost 32 
percent drop in index size caught me off guard.

I had seen the compressed stored field issue come across dev and 
commits, but I didn't connect the dots in my brain.

I would imagine that if Solr has to actually hit the disk, this will be 
faster, but if the data is already in the OS disk cache, it would be 
slower.  I'm curious whether the document cache stores the compressed or 
uncompressed version.  If it's the uncompressed version, the document 
cache would get rid of any penalty.

Are there any config knobs for turning compression on/off, or changing 
the compression algorithm?  Are those knobs available to Solr?  I'm not 
doing anything on the scale of the Hathi Trust, but would I ever have 
any reasonable need to change things?

Thanks,
Shawn


RE: Extreme index size reduction on 4.1-SNAPSHOT?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi, please check this issue:
https://issues.apache.org/jira/browse/LUCENE-4226

But it is enabled because of:
https://issues.apache.org/jira/browse/LUCENE-4509

Since it's suddenly default you would have to completely wipe the index and reindex the data, at least i had to, because of numerous codec exceptions. It significantly reduced very large indexes we have.
 
 
-----Original message-----
> From:Shawn Heisey <so...@elyograg.org>
> Sent: Tue 27-Nov-2012 22:16
> To: solr-user@lucene.apache.org
> Subject: Extreme index size reduction on 4.1-SNAPSHOT?
> 
> With a 4.1 snapshot from a couple of weeks ago, I saw about a 5% drop in 
> index size compared to 3.5.0 when using the same schema. When I updated 
> my 4.1 schema to ICUTokenizer so I could use CJKBigramFilter, my index 
> dropped further -- about 10% less than 3.5, still using the same 4.1 
> snapshot.
> 
> Yesterday I checked out the newest 4.1 snapshot and built the index 
> again.  Comparing a recently optimized 3.5.0 index with the same 
> recently optimized index under the new 4.1, I am seeing more than a 30 
> percent drop in size -- 15.49GB instead of 22.7 GB.  As noted above, 
> some of that drop can be explained by the change in schema, but not THAT 
> much.  I am very impressed.
> 
> Looking at the index directories from yesterday compared to what I 
> remember about the directories a couple of weeks ago, it appears that 
> some of the files that had Lucene40 in the filename now have Lucene41 in 
> the filename.
> 
> Is there any chance that this is an indication of a problem, or is the 
> expected index reduction really that good?
> 
> Thanks,
> Shawn
> 
>