You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Claudio Devecchi <cd...@gmail.com> on 2010/11/12 16:49:31 UTC

Doubt about index size

Hi everybody,

I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
thing, let me try to explain.

I have 1.2 million xml files and I'm indexing then, when I do it for first
time my index size is around 3 GB and in my statistics on
http://localhost:8983/solr/admin/stats.jsp I have two entries that is:

numDocs : 1120171
maxDoc : 1120171

Until here is all right, but if I make a index update reindexing all the
same 1120171 documents I have the stats bellow:

numDocs : 1120171
maxDoc : 2240342

... and my index size goes around 6GB.

Why this happen? What happens on index size if I have the same number of
searcheable docs?

Somebody knows?

Tks

Re: Doubt about index size

Posted by Erick Erickson <er...@gmail.com>.
It's probably a good idea to optimize. How are you re-indexing anyway? DIH?
custom code? post.jar?

Manual optimizing is just issuing the appropriate curl command, see:
http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22

Best
Erick

On Fri, Nov 12, 2010 at 12:13 PM, Claudio Devecchi <cd...@gmail.com>wrote:

> Hi Tom, thanks for your explanation,
>
> Do you recommend the index continues this way? Or can I configure it to
> make
> optmize automatically?
>
> tks
>
> On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom <tburtonw@umich.edu
> >wrote:
>
> > Hi Claudio,
> >
> > What's happening when you re-index the documents is that Solr/Lucene
> > implements an update as a delete plus a new index.  Because of the nature
> of
> > inverted indexes, deleting documents requires a rewrite of the entire
> index.
> > In order to avoid rewriting the entire index each time one document is
> > deleted, deletes are implemented as a list of deleted  internal lucene
> ids.
> > Documents aren't actually removed from the indexes until the index
> segment
> > is merged or an optimize occurs.
> >
> > maxDoc's is the total number of documents indexed without taking into
> > consideration that some of them are marked as deleted
> > numDocs is the actual number of undeleted documents
> >
> > If you run an optimize the index will be rewritten, the index size will
> go
> > down  and numDocs will equal maxDocs
> >
> > Tom Burton-West
> >
> > -----Original Message-----
> > From: Claudio Devecchi [mailto:cdevecchi@gmail.com]
> > Sent: Friday, November 12, 2010 10:50 AM
> > To: Lista Solr
> > Subject: Doubt about index size
> >
> > Hi everybody,
> >
> > I'm doing some indexing testing on solr 1.4.1 and I'm not understanding
> one
> > thing, let me try to explain.
> >
> > I have 1.2 million xml files and I'm indexing then, when I do it for
> first
> > time my index size is around 3 GB and in my statistics on
> > http://localhost:8983/solr/admin/stats.jsp I have two entries that is:
> >
> > numDocs : 1120171
> > maxDoc : 1120171
> >
> > Until here is all right, but if I make a index update reindexing all the
> > same 1120171 documents I have the stats bellow:
> >
> > numDocs : 1120171
> > maxDoc : 2240342
> >
> > ... and my index size goes around 6GB.
> >
> > Why this happen? What happens on index size if I have the same number of
> > searcheable docs?
> >
> > Somebody knows?
> >
> > Tks
> >
>
>
>
> --
> Claudio Devecchi
> flickr.com/cdevecchi
>

RE: Doubt about index size

Posted by "Burton-West, Tom" <tb...@umich.edu>.
An optimize takes lots of cpu and I/O since it has to rewrite your indexes, so only do it when necessary.

You can just use curl to send an optimize message to Solr when you are ready.

See:
http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL

Tom
-----Original Message-----
From: Claudio Devecchi [mailto:cdevecchi@gmail.com] 
Sent: Friday, November 12, 2010 12:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Doubt about index size

Hi Tom, thanks for your explanation,

Do you recommend the index continues this way? Or can I configure it to make
optmize automatically?

tks

On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hi Claudio,
>
> What's happening when you re-index the documents is that Solr/Lucene
> implements an update as a delete plus a new index.  Because of the nature of
> inverted indexes, deleting documents requires a rewrite of the entire index.
> In order to avoid rewriting the entire index each time one document is
> deleted, deletes are implemented as a list of deleted  internal lucene ids.
> Documents aren't actually removed from the indexes until the index segment
> is merged or an optimize occurs.
>
> maxDoc's is the total number of documents indexed without taking into
> consideration that some of them are marked as deleted
> numDocs is the actual number of undeleted documents
>
> If you run an optimize the index will be rewritten, the index size will go
> down  and numDocs will equal maxDocs
>
> Tom Burton-West
>
> -----Original Message-----
> From: Claudio Devecchi [mailto:cdevecchi@gmail.com]
> Sent: Friday, November 12, 2010 10:50 AM
> To: Lista Solr
> Subject: Doubt about index size
>
> Hi everybody,
>
> I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
> thing, let me try to explain.
>
> I have 1.2 million xml files and I'm indexing then, when I do it for first
> time my index size is around 3 GB and in my statistics on
> http://localhost:8983/solr/admin/stats.jsp I have two entries that is:
>
> numDocs : 1120171
> maxDoc : 1120171
>
> Until here is all right, but if I make a index update reindexing all the
> same 1120171 documents I have the stats bellow:
>
> numDocs : 1120171
> maxDoc : 2240342
>
> ... and my index size goes around 6GB.
>
> Why this happen? What happens on index size if I have the same number of
> searcheable docs?
>
> Somebody knows?
>
> Tks
>



-- 
Claudio Devecchi
flickr.com/cdevecchi

Re: Doubt about index size

Posted by Claudio Devecchi <cd...@gmail.com>.
Hi Tom, thanks for your explanation,

Do you recommend the index continues this way? Or can I configure it to make
optmize automatically?

tks

On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hi Claudio,
>
> What's happening when you re-index the documents is that Solr/Lucene
> implements an update as a delete plus a new index.  Because of the nature of
> inverted indexes, deleting documents requires a rewrite of the entire index.
> In order to avoid rewriting the entire index each time one document is
> deleted, deletes are implemented as a list of deleted  internal lucene ids.
> Documents aren't actually removed from the indexes until the index segment
> is merged or an optimize occurs.
>
> maxDoc's is the total number of documents indexed without taking into
> consideration that some of them are marked as deleted
> numDocs is the actual number of undeleted documents
>
> If you run an optimize the index will be rewritten, the index size will go
> down  and numDocs will equal maxDocs
>
> Tom Burton-West
>
> -----Original Message-----
> From: Claudio Devecchi [mailto:cdevecchi@gmail.com]
> Sent: Friday, November 12, 2010 10:50 AM
> To: Lista Solr
> Subject: Doubt about index size
>
> Hi everybody,
>
> I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
> thing, let me try to explain.
>
> I have 1.2 million xml files and I'm indexing then, when I do it for first
> time my index size is around 3 GB and in my statistics on
> http://localhost:8983/solr/admin/stats.jsp I have two entries that is:
>
> numDocs : 1120171
> maxDoc : 1120171
>
> Until here is all right, but if I make a index update reindexing all the
> same 1120171 documents I have the stats bellow:
>
> numDocs : 1120171
> maxDoc : 2240342
>
> ... and my index size goes around 6GB.
>
> Why this happen? What happens on index size if I have the same number of
> searcheable docs?
>
> Somebody knows?
>
> Tks
>



-- 
Claudio Devecchi
flickr.com/cdevecchi

RE: Doubt about index size

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Claudio,

What's happening when you re-index the documents is that Solr/Lucene implements an update as a delete plus a new index.  Because of the nature of inverted indexes, deleting documents requires a rewrite of the entire index. In order to avoid rewriting the entire index each time one document is deleted, deletes are implemented as a list of deleted  internal lucene ids. Documents aren't actually removed from the indexes until the index segment is merged or an optimize occurs.

maxDoc's is the total number of documents indexed without taking into consideration that some of them are marked as deleted
numDocs is the actual number of undeleted documents

If you run an optimize the index will be rewritten, the index size will go down  and numDocs will equal maxDocs 

Tom Burton-West

-----Original Message-----
From: Claudio Devecchi [mailto:cdevecchi@gmail.com] 
Sent: Friday, November 12, 2010 10:50 AM
To: Lista Solr
Subject: Doubt about index size

Hi everybody,

I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one
thing, let me try to explain.

I have 1.2 million xml files and I'm indexing then, when I do it for first
time my index size is around 3 GB and in my statistics on
http://localhost:8983/solr/admin/stats.jsp I have two entries that is:

numDocs : 1120171
maxDoc : 1120171

Until here is all right, but if I make a index update reindexing all the
same 1120171 documents I have the stats bellow:

numDocs : 1120171
maxDoc : 2240342

... and my index size goes around 6GB.

Why this happen? What happens on index size if I have the same number of
searcheable docs?

Somebody knows?

Tks