You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Garrett Heaver <ga...@researchandmarkets.com> on 2004/12/06 11:03:24 UTC

addIndexes() Size

Hi.

 

Its probably really simple to explain this but since I'm not up to speed on
the way Lucene stores the data I'm a little confused.

 

I'm building an Index, which resides on Server A, with the Lucene Service
running on Server B. Now not to bore you with the details but because of the
network transfer rate etc I'm running the actual index on \\ServerA\idx
<file:///\\ServerA\idx>  and building a temp Index at \\ServerB\idx\temp
<file:///\\ServerB\idx\temp>  (obviously because the Local FS is much faster
for the service) and then calling addIndexes to import the temp index to the
ServerA index before destroying the ServerB index, holding for a bit and
then checking for new documents.

 

All works grand BUT the size of the resultant index on ServerA is HUGE in
comparison to one I'd build from start to finish (i.e. a simple addDocument
Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and
idea of how mad this seems, the backed up version of the database from which
the data is pulled is only 2gigs)

 

I've considered it being perhaps the number of Items that had to be
integrated each time addIndexes was called - right now I'm adding around
10,000 at a time (I had done 1000 at a time but this looked like it was
going to end up even larger still)

 

I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a
better understanding of whats going on here.

 

Many thanks for any reply's

Garrett

RE: addIndexes() Size

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

Ok I upgraded to 1.4.3 but that didn't solve the issue - I was still ending
up with huge indexes. So I changed approach - Instead of handing the
addIndexes method IndexReaders I gave it the directory instead as the code
is slightly different - now index size is what I would expect it to be. I
haven't had time to check it out fully yet as to why - but from what I can
see the major difference in the two methods is that the
addIndexes(IndexReaders[]) uses the following

 if (segmentInfos.size() == 1)	// add existing index, if any
      merger.add(new SegmentReader(segmentInfos.info(0)));

Perhaps this is resulting in an unnecessary ballooning of the index?

I'll leave it for someone with a better understanding of the underlying file
system to answer...

Thanks
Garrett

-----Original Message-----
From: Garrett Heaver [mailto:garrett.heaver@researchandmarkets.com] 
Sent: 06 December 2004 17:32
To: 'Lucene Users List'
Subject: RE: addIndexes() Size

Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!)

Will try 1.4.3 tomorrow

Thanks
Garrett

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: 06 December 2004 17:27
To: Lucene Users List
Subject: Re: addIndexes() Size

There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files 
around that were not used.

Are you using Lucene 1.4.3?  It not, try that and see if it helps.

	Erik

On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote:

> No there are no duplicate copies - I've the correct number when I view
> through luke and I don't overlap - the temporary index is destroyed 
> after it
> is added to the main index - I'm currently at index version 159 and it 
> seems
> that all of my .prx files come in at around 1435 megs (ouch)
>
> Thanks
> Garrett
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: 06 December 2004 17:12
> To: Lucene Users List
> Subject: Re: addIndexes() Size
>
> If I were you, I would first use Luke to peek at the index.  You may
> find something obvious there, like multiple copies of the same
> Document.
> Does your temp index 'overlap' with A index in terms of Documents?  If
> so, you will end up with multliple copies, as addIndexes method doesn't
> detect and remove duplicate Documents.
>
> Otis
>
> --- Garrett Heaver <ga...@researchandmarkets.com> wrote:
>
>> Hi.
>>
>>
>>
>> Its probably really simple to explain this but since I'm not up to
>> speed on
>> the way Lucene stores the data I'm a little confused.
>>
>>
>>
>> I'm building an Index, which resides on Server A, with the Lucene
>> Service
>> running on Server B. Now not to bore you with the details but because
>> of the
>> network transfer rate etc I'm running the actual index on
>> \\ServerA\idx
>> <file:///\\ServerA\idx>  and building a temp Index at
>> \\ServerB\idx\temp
>> <file:///\\ServerB\idx\temp>  (obviously because the Local FS is much
>> faster
>> for the service) and then calling addIndexes to import the temp index
>> to the
>> ServerA index before destroying the ServerB index, holding for a bit
>> and
>> then checking for new documents.
>>
>>
>>
>> All works grand BUT the size of the resultant index on ServerA is
>> HUGE in
>> comparison to one I'd build from start to finish (i.e. a simple
>> addDocument
>> Index) - 38gig for 220,000 Unstored Items cannot be right (to give
>> you and
>> idea of how mad this seems, the backed up version of the database
>> from which
>> the data is pulled is only 2gigs)
>>
>>
>>
>> I've considered it being perhaps the number of Items that had to be
>> integrated each time addIndexes was called - right now I'm adding
>> around
>> 10,000 at a time (I had done 1000 at a time but this looked like it
>> was
>> going to end up even larger still)
>>
>>
>>
>> I'm holding off twiddling the minMergeDocs and mergeFactor until I
>> can get a
>> better understanding of whats going on here.
>>
>>
>>
>> Many thanks for any reply's
>>
>> Garrett
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: addIndexes() Size

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!)

Will try 1.4.3 tomorrow

Thanks
Garrett

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: 06 December 2004 17:27
To: Lucene Users List
Subject: Re: addIndexes() Size

There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files 
around that were not used.

Are you using Lucene 1.4.3?  It not, try that and see if it helps.

	Erik

On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote:

> No there are no duplicate copies - I've the correct number when I view
> through luke and I don't overlap - the temporary index is destroyed 
> after it
> is added to the main index - I'm currently at index version 159 and it 
> seems
> that all of my .prx files come in at around 1435 megs (ouch)
>
> Thanks
> Garrett
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: 06 December 2004 17:12
> To: Lucene Users List
> Subject: Re: addIndexes() Size
>
> If I were you, I would first use Luke to peek at the index.  You may
> find something obvious there, like multiple copies of the same
> Document.
> Does your temp index 'overlap' with A index in terms of Documents?  If
> so, you will end up with multliple copies, as addIndexes method doesn't
> detect and remove duplicate Documents.
>
> Otis
>
> --- Garrett Heaver <ga...@researchandmarkets.com> wrote:
>
>> Hi.
>>
>>
>>
>> Its probably really simple to explain this but since I'm not up to
>> speed on
>> the way Lucene stores the data I'm a little confused.
>>
>>
>>
>> I'm building an Index, which resides on Server A, with the Lucene
>> Service
>> running on Server B. Now not to bore you with the details but because
>> of the
>> network transfer rate etc I'm running the actual index on
>> \\ServerA\idx
>> <file:///\\ServerA\idx>  and building a temp Index at
>> \\ServerB\idx\temp
>> <file:///\\ServerB\idx\temp>  (obviously because the Local FS is much
>> faster
>> for the service) and then calling addIndexes to import the temp index
>> to the
>> ServerA index before destroying the ServerB index, holding for a bit
>> and
>> then checking for new documents.
>>
>>
>>
>> All works grand BUT the size of the resultant index on ServerA is
>> HUGE in
>> comparison to one I'd build from start to finish (i.e. a simple
>> addDocument
>> Index) - 38gig for 220,000 Unstored Items cannot be right (to give
>> you and
>> idea of how mad this seems, the backed up version of the database
>> from which
>> the data is pulled is only 2gigs)
>>
>>
>>
>> I've considered it being perhaps the number of Items that had to be
>> integrated each time addIndexes was called - right now I'm adding
>> around
>> 10,000 at a time (I had done 1000 at a time but this looked like it
>> was
>> going to end up even larger still)
>>
>>
>>
>> I'm holding off twiddling the minMergeDocs and mergeFactor until I
>> can get a
>> better understanding of whats going on here.
>>
>>
>>
>> Many thanks for any reply's
>>
>> Garrett
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: addIndexes() Size

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files 
around that were not used.

Are you using Lucene 1.4.3?  It not, try that and see if it helps.

	Erik

On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote:

> No there are no duplicate copies - I've the correct number when I view
> through luke and I don't overlap - the temporary index is destroyed 
> after it
> is added to the main index - I'm currently at index version 159 and it 
> seems
> that all of my .prx files come in at around 1435 megs (ouch)
>
> Thanks
> Garrett
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: 06 December 2004 17:12
> To: Lucene Users List
> Subject: Re: addIndexes() Size
>
> If I were you, I would first use Luke to peek at the index.  You may
> find something obvious there, like multiple copies of the same
> Document.
> Does your temp index 'overlap' with A index in terms of Documents?  If
> so, you will end up with multliple copies, as addIndexes method doesn't
> detect and remove duplicate Documents.
>
> Otis
>
> --- Garrett Heaver <ga...@researchandmarkets.com> wrote:
>
>> Hi.
>>
>>
>>
>> Its probably really simple to explain this but since I'm not up to
>> speed on
>> the way Lucene stores the data I'm a little confused.
>>
>>
>>
>> I'm building an Index, which resides on Server A, with the Lucene
>> Service
>> running on Server B. Now not to bore you with the details but because
>> of the
>> network transfer rate etc I'm running the actual index on
>> \\ServerA\idx
>> <file:///\\ServerA\idx>  and building a temp Index at
>> \\ServerB\idx\temp
>> <file:///\\ServerB\idx\temp>  (obviously because the Local FS is much
>> faster
>> for the service) and then calling addIndexes to import the temp index
>> to the
>> ServerA index before destroying the ServerB index, holding for a bit
>> and
>> then checking for new documents.
>>
>>
>>
>> All works grand BUT the size of the resultant index on ServerA is
>> HUGE in
>> comparison to one I'd build from start to finish (i.e. a simple
>> addDocument
>> Index) - 38gig for 220,000 Unstored Items cannot be right (to give
>> you and
>> idea of how mad this seems, the backed up version of the database
>> from which
>> the data is pulled is only 2gigs)
>>
>>
>>
>> I've considered it being perhaps the number of Items that had to be
>> integrated each time addIndexes was called - right now I'm adding
>> around
>> 10,000 at a time (I had done 1000 at a time but this looked like it
>> was
>> going to end up even larger still)
>>
>>
>>
>> I'm holding off twiddling the minMergeDocs and mergeFactor until I
>> can get a
>> better understanding of whats going on here.
>>
>>
>>
>> Many thanks for any reply's
>>
>> Garrett
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: addIndexes() Size

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

No there are no duplicate copies - I've the correct number when I view
through luke and I don't overlap - the temporary index is destroyed after it
is added to the main index - I'm currently at index version 159 and it seems
that all of my .prx files come in at around 1435 megs (ouch)

Thanks
Garrett

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: 06 December 2004 17:12
To: Lucene Users List
Subject: Re: addIndexes() Size

If I were you, I would first use Luke to peek at the index.  You may
find something obvious there, like multiple copies of the same
Document.
Does your temp index 'overlap' with A index in terms of Documents?  If
so, you will end up with multliple copies, as addIndexes method doesn't
detect and remove duplicate Documents.

Otis

--- Garrett Heaver <ga...@researchandmarkets.com> wrote:

> Hi.
> 
>  
> 
> Its probably really simple to explain this but since I'm not up to
> speed on
> the way Lucene stores the data I'm a little confused.
> 
>  
> 
> I'm building an Index, which resides on Server A, with the Lucene
> Service
> running on Server B. Now not to bore you with the details but because
> of the
> network transfer rate etc I'm running the actual index on
> \\ServerA\idx
> <file:///\\ServerA\idx>  and building a temp Index at
> \\ServerB\idx\temp
> <file:///\\ServerB\idx\temp>  (obviously because the Local FS is much
> faster
> for the service) and then calling addIndexes to import the temp index
> to the
> ServerA index before destroying the ServerB index, holding for a bit
> and
> then checking for new documents.
> 
>  
> 
> All works grand BUT the size of the resultant index on ServerA is
> HUGE in
> comparison to one I'd build from start to finish (i.e. a simple
> addDocument
> Index) - 38gig for 220,000 Unstored Items cannot be right (to give
> you and
> idea of how mad this seems, the backed up version of the database
> from which
> the data is pulled is only 2gigs)
> 
>  
> 
> I've considered it being perhaps the number of Items that had to be
> integrated each time addIndexes was called - right now I'm adding
> around
> 10,000 at a time (I had done 1000 at a time but this looked like it
> was
> going to end up even larger still)
> 
>  
> 
> I'm holding off twiddling the minMergeDocs and mergeFactor until I
> can get a
> better understanding of whats going on here.
> 
>  
> 
> Many thanks for any reply's
> 
> Garrett
> 
>  
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: addIndexes() Size

Posted by Otis Gospodnetic <ot...@yahoo.com>.

If I were you, I would first use Luke to peek at the index.  You may
find something obvious there, like multiple copies of the same
Document.
Does your temp index 'overlap' with A index in terms of Documents?  If
so, you will end up with multliple copies, as addIndexes method doesn't
detect and remove duplicate Documents.

Otis

--- Garrett Heaver <ga...@researchandmarkets.com> wrote:

> Hi.
> 
>  
> 
> Its probably really simple to explain this but since I'm not up to
> speed on
> the way Lucene stores the data I'm a little confused.
> 
>  
> 
> I'm building an Index, which resides on Server A, with the Lucene
> Service
> running on Server B. Now not to bore you with the details but because
> of the
> network transfer rate etc I'm running the actual index on
> \\ServerA\idx
> <file:///\\ServerA\idx>  and building a temp Index at
> \\ServerB\idx\temp
> <file:///\\ServerB\idx\temp>  (obviously because the Local FS is much
> faster
> for the service) and then calling addIndexes to import the temp index
> to the
> ServerA index before destroying the ServerB index, holding for a bit
> and
> then checking for new documents.
> 
>  
> 
> All works grand BUT the size of the resultant index on ServerA is
> HUGE in
> comparison to one I'd build from start to finish (i.e. a simple
> addDocument
> Index) - 38gig for 220,000 Unstored Items cannot be right (to give
> you and
> idea of how mad this seems, the backed up version of the database
> from which
> the data is pulled is only 2gigs)
> 
>  
> 
> I've considered it being perhaps the number of Items that had to be
> integrated each time addIndexes was called - right now I'm adding
> around
> 10,000 at a time (I had done 1000 at a time but this looked like it
> was
> going to end up even larger still)
> 
>  
> 
> I'm holding off twiddling the minMergeDocs and mergeFactor until I
> can get a
> better understanding of whats going on here.
> 
>  
> 
> Many thanks for any reply's
> 
> Garrett
> 
>  
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org