You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "engy.ali" <om...@hotmail.com> on 2009/08/25 21:30:45 UTC

Solr index - Size and indexing speed

 Summary
===============

I had about 120,000 object of total size 71.2 GB, those objects are already
indexed using Lucene. The index size is about 111 GB.

I tried to use solr 1.4 nightly build to index the same collection. I
divided collection on three servers, each server had 5 solr instances (not
solr cores) up and running. 

After collection had been indexed, i merge the 15 indexes.

Problems
==============

1. The new merged index size is about 411 GB (i.e: 4 times larger than old
index using lucene)

I tried to index only on object using lucene and same object using solr to
verify the size and the result was that the new index is about twice size of
old index.

DO you have any idea what might be the reason?


2. the indexing speed is slow, 100 object on single solr instance were
indexed in 1 hour so i estimated that 1000 on single instance can be done in
10 hours, but that was not the case, the indexing time exceeds estimated
time by about 12 hour.

is that might be related to the growth of index?if not, so what might be the
reason.

Note: I do a commit/100 object and an optimize by the end of the whole
operation. I also changed the mergeFactor from 10 to 15.


3.  I google and found out that solr is using an inverted index, but I want
to know what is the internal structure of solr index,for example if i have a
word and its stems, how it will be store in the index 

Thanks, 
Engy
-- 
View this message in context: http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p25140702.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr index - Size and indexing speed

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Sat, Aug 29, 2009 at 7:09 AM, engy.ali<om...@hotmail.com> wrote:
> I thought that optimization would decrease or at least be equal to the same
> index size before optimization

Some index structures like norms are non-sparse.  Index one unique
field with norms and there is a byte allocated for every document in
the index.  Merge that with another index, and the size for the norms
goes to byte[maxDoc()]

-Yonik
http://www.lucidimagination.com

RE: Solr index - Size and indexing speed

Posted by Fuad Efendi <fu...@efendi.ca>.
>I tried to merge the 15 indexes again, and I found out that the new merged
>index (without opitmization) size was about 351 GB , but when I optimize it
>the size return back to 411 GB, Why?


Just as a sample, IOT in Oracle... 


Ok, just in a kids-lang, what 'optimization' means? It means that Map is
physically sorted by Key... For Lucene, 'map' is 'term -> documentIDs'.

Ok, still no any problem... but what if KEY is compressed? (or, for
instance, 'normalized' if you are still with RDBMS) And we need to
decompress it for uniting 15 maps?

-Fuad




RE: Solr index - Size and indexing speed

Posted by "engy.ali" <om...@hotmail.com>.
Hi, 

Thanks for your reply.

I will work on your suggestion for using only one solr instance.

I tried to merge the 15 indexes again, and I found out that the new merged
index (without opitmization) size was about 351 GB , but when I optimize it
the size return back to 411 GB, Why?

I thought that optimization would decrease or at least be equal to the same
index size before optimization



Funtick wrote:
> 
> Hi,
> 
> Can you try to use single SOLR instance with heavy RAM (so that
> ramBufferSizeMB=8192 for instance) and mergeFactor=10? Single SOLR
> instance
> is fast enough (> 100 client threads of Tomcat; configurable) - I usually
> prefer single instance for single "writable" box with heavy RAM allocation
> and good I/O.
> 
> Merging 15 indexes and 4-times larger size could happen, for instance,
> because of differences in SOLR Schema and Lucene; ensure that schema is
> the
> same (using Luke for instance). SOLR 1.4 has some new powerful features
> such
> as document->term cache stored somewhere (uninverted index) (Yonik), term
> vectors, stored=true, copyField, etc. 
> 
> Do not do commit per 100; do it once at the end...
> 
> 
> 
> -----Original Message-----
> From: engy.ali [mailto:omeshmesh@hotmail.com] 
> Sent: August-25-09 3:31 PM
> To: solr-user@lucene.apache.org
> Subject: Solr index - Size and indexing speed
> 
> 
>  Summary
> ===============
> 
> I had about 120,000 object of total size 71.2 GB, those objects are
> already
> indexed using Lucene. The index size is about 111 GB.
> 
> I tried to use solr 1.4 nightly build to index the same collection. I
> divided collection on three servers, each server had 5 solr instances (not
> solr cores) up and running. 
> 
> After collection had been indexed, i merge the 15 indexes.
> 
> Problems
> ==============
> 
> 1. The new merged index size is about 411 GB (i.e: 4 times larger than old
> index using lucene)
> 
> I tried to index only on object using lucene and same object using solr to
> verify the size and the result was that the new index is about twice size
> of
> old index.
> 
> DO you have any idea what might be the reason?
> 
> 
> 2. the indexing speed is slow, 100 object on single solr instance were
> indexed in 1 hour so i estimated that 1000 on single instance can be done
> in
> 10 hours, but that was not the case, the indexing time exceeds estimated
> time by about 12 hour.
> 
> is that might be related to the growth of index?if not, so what might be
> the
> reason.
> 
> Note: I do a commit/100 object and an optimize by the end of the whole
> operation. I also changed the mergeFactor from 10 to 15.
> 
> 
> 3.  I google and found out that solr is using an inverted index, but I
> want
> to know what is the internal structure of solr index,for example if i have
> a
> word and its stems, how it will be store in the index 
> 
> Thanks, 
> Engy
> -- 
> View this message in context:
> http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p251407
> 02.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p25201981.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr index - Size and indexing speed

Posted by Fuad Efendi <fu...@efendi.ca>.
Hi,

Can you try to use single SOLR instance with heavy RAM (so that
ramBufferSizeMB=8192 for instance) and mergeFactor=10? Single SOLR instance
is fast enough (> 100 client threads of Tomcat; configurable) - I usually
prefer single instance for single "writable" box with heavy RAM allocation
and good I/O.

Merging 15 indexes and 4-times larger size could happen, for instance,
because of differences in SOLR Schema and Lucene; ensure that schema is the
same (using Luke for instance). SOLR 1.4 has some new powerful features such
as document->term cache stored somewhere (uninverted index) (Yonik), term
vectors, stored=true, copyField, etc. 

Do not do commit per 100; do it once at the end...



-----Original Message-----
From: engy.ali [mailto:omeshmesh@hotmail.com] 
Sent: August-25-09 3:31 PM
To: solr-user@lucene.apache.org
Subject: Solr index - Size and indexing speed


 Summary
===============

I had about 120,000 object of total size 71.2 GB, those objects are already
indexed using Lucene. The index size is about 111 GB.

I tried to use solr 1.4 nightly build to index the same collection. I
divided collection on three servers, each server had 5 solr instances (not
solr cores) up and running. 

After collection had been indexed, i merge the 15 indexes.

Problems
==============

1. The new merged index size is about 411 GB (i.e: 4 times larger than old
index using lucene)

I tried to index only on object using lucene and same object using solr to
verify the size and the result was that the new index is about twice size of
old index.

DO you have any idea what might be the reason?


2. the indexing speed is slow, 100 object on single solr instance were
indexed in 1 hour so i estimated that 1000 on single instance can be done in
10 hours, but that was not the case, the indexing time exceeds estimated
time by about 12 hour.

is that might be related to the growth of index?if not, so what might be the
reason.

Note: I do a commit/100 object and an optimize by the end of the whole
operation. I also changed the mergeFactor from 10 to 15.


3.  I google and found out that solr is using an inverted index, but I want
to know what is the internal structure of solr index,for example if i have a
word and its stems, how it will be store in the index 

Thanks, 
Engy
-- 
View this message in context:
http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p251407
02.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr index - Size and indexing speed

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Aug 25, 2009 at 3:30 PM, engy.ali<om...@hotmail.com> wrote:
>
>  Summary
> ===============
>
> I had about 120,000 object of total size 71.2 GB, those objects are already
> indexed using Lucene. The index size is about 111 GB.
>
> I tried to use solr 1.4 nightly build to index the same collection. I
> divided collection on three servers, each server had 5 solr instances (not
> solr cores) up and running.
>
> After collection had been indexed, i merge the 15 indexes.
>
> Problems
> ==============
>
> 1. The new merged index size is about 411 GB (i.e: 4 times larger than old
> index using lucene)
>
> I tried to index only on object using lucene and same object using solr to
> verify the size and the result was that the new index is about twice size of
> old index.
>
> DO you have any idea what might be the reason?

Check out the schema you are using - it may contain copyFields, etc.
You should be able to get to exactly the same size of index as you had
with Lucene (Solr just uses Lucene for indexing after all).

-Yonik
http://www.lucidimagination.com