You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Bennett, Tony" <Be...@con-way.com> on 2011/08/15 20:39:26 UTC

What kind of System Resources are required to index 625 million row table...???

We are examining the possibility of using Lucene to provide Text Search 
capabilities for a 625 million row DB2 table.

The table has 6 fields, all which must be stored in the Lucene Index.  
The largest column is 229 characters, the others are 8, 12, 30, and 1....
...with an additional column that is an 8 byte integer (i.e. a 'C' long long).

We have written a test app on a development system (AIX 6.1),
and have successfully Indexed 625 million rows...
...which took about 22 hours.

When writing the "search" application... we find a simple version works, however,
if we add a Filter or a "sort" to it... we get an "out of memory" exception.

Before continuing our research, we'd like to find a way to determine 
what system resources are required to run this kind of application...???
In other words, how do we calculate the memory needs...???

Have others created a similar sized Index to run on a single "shared" server...???


Current Environment:

	Lucene Version:	3.2
	Java Version:	J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
                        (i.e. 64 bit Java 6)
	OS:			AIX 6.1
	Platform:		PPC  (IBM P520)
	cores:		2
	Memory:		8 GB
	jvm memory:	`	-Xms4072m -Xmx4072m

Any guidance would be greatly appreciated.

-tony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Posted by "Bennett, Tony" <Be...@con-way.com>.

Thanks for the suggestion.

Yes, we are using "no_norms".

-----Original Message-----
From: Mark Harwood [mailto:markharw00d@yahoo.co.uk] 
Sent: Tuesday, August 16, 2011 10:12 AM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million row table...???

Check  "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled.



On 16 Aug 2011, at 15:11, Bennett, Tony wrote:

> Thank you for your response.
> 
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, and we are
> storing it as "NumericField".
> 
> We are sorting on timestamp, so that we can give our
> users the most "current" matches, since we are limiting
> the number of responses to about 1000.  We are concerned
> that limiting the number of responses without sorting,
> may give the user the "oldest" matches, which is not 
> what they want.
> 
> Your suggestion about reducing the granularity of the 
> sort is interesting.  We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of 
> "date" instead of "date+time", which would be used for 
> sorting only. 
> 
> -tony
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com] 
> Sent: Tuesday, August 16, 2011 5:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
> 
> About your OOM. Grant asked a question that's pretty important,
> how many unique terms in the field(s) you sorted on? At a guess,
> you tried sorting on your timestamp and your timestamp has
> millisecond or less granularity, so there are 625M of them.
> 
> Memory requirements for sorting grow as the number of *unique*
> terms. So you might be able to reduce the sorting requirements
> dramatically if you can use a coarser time granularity.
> 
> And if you're storing your timestamp as a string type, that's
> even worse, there are 60 or so bytes of overhead for
> each string.... see NumericField....
> 
> And if you can't reduce the granularity of the timestamp, there
> are some interesting techniques for reducing the memory
> requirements of timestamps that you sort on that we can discuss....
> 
> Luke can answer these questions if you point it at your index,
> but it may take a while to examine your index, so be patient.
> 
> Best
> Erick
> 
> On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
>> Thanks for the quick response.
>> 
>> As to your questions:
>> 
>>  Can you talk a bit more about what the search part of this is?
>>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>>  performance, memory, etc.
>> 
>> We currently have a "exact match" search facility, which uses SQL.
>> We would like to add "text search" capabilities...
>> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
>> A future enhancement would be to add a synonym list.
>> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
>> ...in the interest of full disclosure, the fields are:
>>   - corp  - corporation that owns the document
>>   - type  - document type
>>   - tmst  - creation timestamp
>>   - xmlid - xml namespace ID
>>   - tag   - meta data qualifier
>>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>> 
>> 
>> 
>>  Was this single threaded or multi-threaded?  How big was the resulting index?
>> 
>> The search would be a threaded application.
>> 
>>  How big was the resulting index?
>> 
>> The index that was built was 70 GB in size.
>> 
>>  Have you tried increasing the heap size?
>> 
>> We have increased the up to 4 GB... on an 8 GB machine...
>> That's why we'd like a methodology for calculating memory requirements
>> to see if this application is even feasible.
>> 
>> Thanks,
>> -tony
>> 
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Monday, August 15, 2011 2:33 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>> 
>> 
>> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>> 
>>> We are examining the possibility of using Lucene to provide Text Search
>>> capabilities for a 625 million row DB2 table.
>>> 
>>> The table has 6 fields, all which must be stored in the Lucene Index.
>>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>> 
>> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>> 
>>> 
>>> We have written a test app on a development system (AIX 6.1),
>>> and have successfully Indexed 625 million rows...
>>> ...which took about 22 hours.
>> 
>> Was this single threaded or multi-threaded?  How big was the resulting index?
>> 
>> 
>>> 
>>> When writing the "search" application... we find a simple version works, however,
>>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>> 
>> 
>> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>> 
>> 
>>> Before continuing our research, we'd like to find a way to determine
>>> what system resources are required to run this kind of application...???
>> 
>> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>> 
>>> In other words, how do we calculate the memory needs...???
>>> 
>>> Have others created a similar sized Index to run on a single "shared" server...???
>>> 
>> 
>> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>> 
>>> 
>>> Current Environment:
>>> 
>>>       Lucene Version: 3.2
>>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>>                        (i.e. 64 bit Java 6)
>>>       OS:                     AIX 6.1
>>>       Platform:               PPC  (IBM P520)
>>>       cores:          2
>>>       Memory:         8 GB
>>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>> 
>>> Any guidance would be greatly appreciated.
>>> 
>>> -tony
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> Lucid Imagination
>> http://www.lucidimagination.com
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Mark Harwood <ma...@yahoo.co.uk>.

Check  "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled.



On 16 Aug 2011, at 15:11, Bennett, Tony wrote:

> Thank you for your response.
> 
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, and we are
> storing it as "NumericField".
> 
> We are sorting on timestamp, so that we can give our
> users the most "current" matches, since we are limiting
> the number of responses to about 1000.  We are concerned
> that limiting the number of responses without sorting,
> may give the user the "oldest" matches, which is not 
> what they want.
> 
> Your suggestion about reducing the granularity of the 
> sort is interesting.  We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of 
> "date" instead of "date+time", which would be used for 
> sorting only. 
> 
> -tony
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com] 
> Sent: Tuesday, August 16, 2011 5:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
> 
> About your OOM. Grant asked a question that's pretty important,
> how many unique terms in the field(s) you sorted on? At a guess,
> you tried sorting on your timestamp and your timestamp has
> millisecond or less granularity, so there are 625M of them.
> 
> Memory requirements for sorting grow as the number of *unique*
> terms. So you might be able to reduce the sorting requirements
> dramatically if you can use a coarser time granularity.
> 
> And if you're storing your timestamp as a string type, that's
> even worse, there are 60 or so bytes of overhead for
> each string.... see NumericField....
> 
> And if you can't reduce the granularity of the timestamp, there
> are some interesting techniques for reducing the memory
> requirements of timestamps that you sort on that we can discuss....
> 
> Luke can answer these questions if you point it at your index,
> but it may take a while to examine your index, so be patient.
> 
> Best
> Erick
> 
> On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
>> Thanks for the quick response.
>> 
>> As to your questions:
>> 
>>  Can you talk a bit more about what the search part of this is?
>>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>>  performance, memory, etc.
>> 
>> We currently have a "exact match" search facility, which uses SQL.
>> We would like to add "text search" capabilities...
>> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
>> A future enhancement would be to add a synonym list.
>> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
>> ...in the interest of full disclosure, the fields are:
>>   - corp  - corporation that owns the document
>>   - type  - document type
>>   - tmst  - creation timestamp
>>   - xmlid - xml namespace ID
>>   - tag   - meta data qualifier
>>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>> 
>> 
>> 
>>  Was this single threaded or multi-threaded?  How big was the resulting index?
>> 
>> The search would be a threaded application.
>> 
>>  How big was the resulting index?
>> 
>> The index that was built was 70 GB in size.
>> 
>>  Have you tried increasing the heap size?
>> 
>> We have increased the up to 4 GB... on an 8 GB machine...
>> That's why we'd like a methodology for calculating memory requirements
>> to see if this application is even feasible.
>> 
>> Thanks,
>> -tony
>> 
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Monday, August 15, 2011 2:33 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>> 
>> 
>> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>> 
>>> We are examining the possibility of using Lucene to provide Text Search
>>> capabilities for a 625 million row DB2 table.
>>> 
>>> The table has 6 fields, all which must be stored in the Lucene Index.
>>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>> 
>> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>> 
>>> 
>>> We have written a test app on a development system (AIX 6.1),
>>> and have successfully Indexed 625 million rows...
>>> ...which took about 22 hours.
>> 
>> Was this single threaded or multi-threaded?  How big was the resulting index?
>> 
>> 
>>> 
>>> When writing the "search" application... we find a simple version works, however,
>>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>> 
>> 
>> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>> 
>> 
>>> Before continuing our research, we'd like to find a way to determine
>>> what system resources are required to run this kind of application...???
>> 
>> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>> 
>>> In other words, how do we calculate the memory needs...???
>>> 
>>> Have others created a similar sized Index to run on a single "shared" server...???
>>> 
>> 
>> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>> 
>>> 
>>> Current Environment:
>>> 
>>>       Lucene Version: 3.2
>>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>>                        (i.e. 64 bit Java 6)
>>>       OS:                     AIX 6.1
>>>       Platform:               PPC  (IBM P520)
>>>       cores:          2
>>>       Memory:         8 GB
>>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>> 
>>> Any guidance would be greatly appreciated.
>>> 
>>> -tony
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> Lucid Imagination
>> http://www.lucidimagination.com
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

The answer is simple: It does make no sense for numeric fields to index
coarser granularity in addition to finer granularity fields. Because
NumericField already indexes a lot of additional terms in coarser
granularity (if precisionStep = 1..63) to speed up range queries. This would
simply make your index larger and consume more memory instead of helping.
With Lucene trunk/4.0 this is different, because every field is handled like
a separate "index" with its own term dictionary. But for Lucene 3.x all
terms from all fields are in one big term index. Also norms and term
positions should be disabled for numeric fields (which NumericField does by
default in Lucene, Solr is different here)

Of course it may make sense to *only* index coarser granularity, so you
should of course only index the granularity you really need. So when you
need queries downto day range, use an integer (not long) NumericField and
index the timestamp as Date.getTime()/86400000L (or like that).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Thursday, August 18, 2011 2:49 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625
million
> row table...???
> 
> Uwe:
> 
> Thanks, I guess my mind is still stuck on the really old versions of Solr!
> 
> Quick clarification, which part "won't work"? I'm assuming it's the
splitting up
> of the dates into year, month, and date. Or are you talking about indexing
the
> dates with coarser granularity? Or both?
> 
> Thanks again,
> Erick
> 
> On Tue, Aug 16, 2011 at 3:46 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > Hi Erick,
> >
> > This is only true, if you have string fields. Once you have the long
> > values in FieldCache they will always use exactly the same space.
> > Having more fields will in contrast blow up your IndexReader, as it
> > needs much more RAM to hold an even larger term index (because you
> > have an even larger termsindex with different fields).
> >
> > The user told, he is using NumericField, so the uniquness is
> > irrelevant here, strings are never used. To make the termindex smaller
> > and reduce ram usage, the only suggestion I have is to use a
> > precisionStep of Integer.MAX_VALUE for all NumericField that are
> > solely used for sorting. The additional terms are only needed for
> NumericRangeQueries.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: Erick Erickson [mailto:erickerickson@gmail.com]
> >> Sent: Tuesday, August 16, 2011 8:14 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: What kind of System Resources are required to index 625
> > million
> >> row table...???
> >>
> >> Using a new field with coarser granularity will work fine, this a
> >> common
> > thing
> >> to do for this kind of issue.
> >>
> >> Lucene is trying to load 625M longs into memory, in addition to any
> >> other
> > stuff.
> >> Ouch!
> >>
> >> If you want to get really clever, you can index several fields, say
> >> year,
> > month,
> >> and day for each date.. The total number of unique values that need
> >> to be sorted then is (the number of years in your corpus + 12 + 31).
> >> Very few
> > unique
> >> values in all. And you can extend this to hours, minutes, seconds and
> >> milliseconds which adds a piddling 2,084 unique terms. Of course all
> >> your
> > sorts
> >> have to be re-written to take all these field into account, but it's
> > do-able.
> >>
> >> Warning: This has some gotchas, but.... There is one other thing you
> >> can
> > try,
> >> that's sorting by INDEXORDER. This would only work for you if you
> >> index
> > the
> >> records in date order in the first place, so the first document you
> > indexed was
> >> the oldest, the second the next-oldest, etc. This won't work if you
> >> update existing documents since updates are really delete/add and
> >> would mess this ordering up. But the docs don't change, this might do.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Tue, Aug 16, 2011 at 10:11 AM, Bennett, Tony <Bennett.Tony@con-
> >> way.com> wrote:
> >> > Thank you for your response.
> >> >
> >> > You are correct, we are sorting on timestamp.
> >> > Timestamp has microsecond granualarity, and we are storing it as
> >> > "NumericField".
> >> >
> >> > We are sorting on timestamp, so that we can give our users the most
> >> > "current" matches, since we are limiting the number of responses to
> >> > about 1000.  We are concerned that limiting the number of responses
> >> > without sorting, may give the user the "oldest" matches, which is
> >> > not what they want.
> >> >
> >> > Your suggestion about reducing the granularity of the sort is
> >> > interesting.  We must "retain" the granularity of the "original"
> >> > timestamp for Index maintenance purposes, but we could add another
> >> > field, with a granularity of "date" instead of "date+time", which
> >> > would be used for sorting only.
> >> >
> >> > -tony
> >> >
> >> > -----Original Message-----
> >> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> >> > Sent: Tuesday, August 16, 2011 5:54 AM
> >> > To: java-user@lucene.apache.org
> >> > Subject: Re: What kind of System Resources are required to index
> >> > 625
> > million
> >> row table...???
> >> >
> >> > About your OOM. Grant asked a question that's pretty important, how
> >> > many unique terms in the field(s) you sorted on? At a guess, you
> >> > tried sorting on your timestamp and your timestamp has millisecond
> >> > or less granularity, so there are 625M of them.
> >> >
> >> > Memory requirements for sorting grow as the number of *unique* terms.
> >> > So you might be able to reduce the sorting requirements
> >> > dramatically if you can use a coarser time granularity.
> >> >
> >> > And if you're storing your timestamp as a string type, that's even
> >> > worse, there are 60 or so bytes of overhead for each string.... see
> >> > NumericField....
> >> >
> >> > And if you can't reduce the granularity of the timestamp, there are
> >> > some interesting techniques for reducing the memory requirements of
> >> > timestamps that you sort on that we can discuss....
> >> >
> >> > Luke can answer these questions if you point it at your index, but
> >> > it may take a while to examine your index, so be patient.
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Bennett.Tony@con-
> >> way.com> wrote:
> >> >> Thanks for the quick response.
> >> >>
> >> >> As to your questions:
> >> >>
> >> >>  Can you talk a bit more about what the search part of this is?
> >> >>  What are you hoping to get that you don't already have by adding
> >> >> in search?  Choices for fields can have impact on
> >> >>  performance, memory, etc.
> >> >>
> >> >> We currently have a "exact match" search facility, which uses SQL.
> >> >> We would like to add "text search" capabilities...
> >> >> ...initially, having the ability to search the 229 character field
> >> >> for
> > a given
> >> word, or phrase, instead of an exact match.
> >> >> A future enhancement would be to add a synonym list.
> >> >> As to "field choice", yes, it is possible that all fields would be
> > involved in the
> >> "search"...
> >> >> ...in the interest of full disclosure, the fields are:
> >> >>   - corp  - corporation that owns the document
> >> >>   - type  - document type
> >> >>   - tmst  - creation timestamp
> >> >>   - xmlid - xml namespace ID
> >> >>   - tag   - meta data qualifier
> >> >>   - data  - actual metadata  (example:  carton of red 3 ring
> >> >> binders
> >> >> )
> >> >>
> >> >>
> >> >>
> >> >>  Was this single threaded or multi-threaded?  How big was the
> >> >> resulting
> >> index?
> >> >>
> >> >> The search would be a threaded application.
> >> >>
> >> >>  How big was the resulting index?
> >> >>
> >> >> The index that was built was 70 GB in size.
> >> >>
> >> >>  Have you tried increasing the heap size?
> >> >>
> >> >> We have increased the up to 4 GB... on an 8 GB machine...
> >> >> That's why we'd like a methodology for calculating memory
> >> >> requirements to see if this application is even feasible.
> >> >>
> >> >> Thanks,
> >> >> -tony
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: Grant Ingersoll [mailto:gsingers@apache.org]
> >> >> Sent: Monday, August 15, 2011 2:33 PM
> >> >> To: java-user@lucene.apache.org
> >> >> Subject: Re: What kind of System Resources are required to index
> >> >> 625
> >> million row table...???
> >> >>
> >> >>
> >> >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
> >> >>
> >> >>> We are examining the possibility of using Lucene to provide Text
> >> >>> Search capabilities for a 625 million row DB2 table.
> >> >>>
> >> >>> The table has 6 fields, all which must be stored in the Lucene
Index.
> >> >>> The largest column is 229 characters, the others are 8, 12, 30,
> >> >>> and
> > 1....
> >> >>> ...with an additional column that is an 8 byte integer (i.e. a 'C'
> > long long).
> >> >>
> >> >> Can you talk a bit more about what the search part of this is?
> >> >> What
> > are you
> >> hoping to get that you don't already have by adding in search?
> >> Choices
> > for
> >> fields can have impact on performance, memory, etc.
> >> >>
> >> >>>
> >> >>> We have written a test app on a development system (AIX 6.1), and
> >> >>> have successfully Indexed 625 million rows...
> >> >>> ...which took about 22 hours.
> >> >>
> >> >> Was this single threaded or multi-threaded?  How big was the
> >> >> resulting
> >> index?
> >> >>
> >> >>
> >> >>>
> >> >>> When writing the "search" application... we find a simple version
> >> >>> works, however, if we add a Filter or a "sort" to it... we get an
> >> >>> "out
> > of
> >> memory" exception.
> >> >>>
> >> >>
> >> >> How many terms do you have in your index and in the field you are
> >> sorting/filtering on?  Have you tried increasing the heap size?
> >> >>
> >> >>
> >> >>> Before continuing our research, we'd like to find a way to
> >> >>> determine what system resources are required to run this kind of
> > application...???
> >> >>
> >> >> I don't know that there is a straight forward answer here with the
> >> information you've presented.  It can depend on how you intend to
> >> search/sort/filter/facet, etc.  General rule of thumb is that when
> >> you get
> > over
> >> 100M documents, you need to shard, but you also have pretty small
> > documents
> >> so your mileage may vary.   I've seen indexes in your range on a
> >> single
> > machine
> >> (for small docs) with low search volumes, but that isn't to say it
> >> will
> > work for
> >> you without more insight into your documents, etc.
> >> >>
> >> >>> In other words, how do we calculate the memory needs...???
> >> >>>
> >> >>> Have others created a similar sized Index to run on a single
"shared"
> >> server...???
> >> >>>
> >> >>
> >> >> Off the cuff, I think you are pushing the capabilities of doing
> >> >> this on
> > a single
> >> machine, especially the one you have spec'd out below.
> >> >>
> >> >>>
> >> >>> Current Environment:
> >> >>>
> >> >>>       Lucene Version: 3.2
> >> >>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build
> >> >>> jvmap6460-20090215_29883
> >> >>>                        (i.e. 64 bit Java 6)
> >> >>>       OS:                     AIX 6.1
> >> >>>       Platform:               PPC  (IBM P520)
> >> >>>       cores:          2
> >> >>>       Memory:         8 GB
> >> >>>       jvm memory:     `       -Xms4072m -Xmx4072m
> >> >>>
> >> >>> Any guidance would be greatly appreciated.
> >> >>>
> >> >>> -tony
> >> >>
> >> >> --------------------------------------------
> >> >> Grant Ingersoll
> >> >> Lucid Imagination
> >> >> http://www.lucidimagination.com
> >> >>
> >> >>
> >> >> ------------------------------------------------------------------
> >> >> --- To unsubscribe, e-mail:
> >> >> java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >> ------------------------------------------------------------------
> >> >> --- To unsubscribe, e-mail:
> >> >> java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> > -------------------------------------------------------------------
> >> > -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >> > -------------------------------------------------------------------
> >> > -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Erick Erickson <er...@gmail.com>.

Uwe:

Thanks, I guess my mind is still stuck on the really old versions of Solr!

Quick clarification, which part "won't work"? I'm assuming it's the splitting
up of the dates into year, month, and date. Or are you talking about
indexing the dates with coarser granularity? Or both?

Thanks again,
Erick

On Tue, Aug 16, 2011 at 3:46 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Hi Erick,
>
> This is only true, if you have string fields. Once you have the long values
> in FieldCache they will always use exactly the same space. Having more
> fields will in contrast blow up your IndexReader, as it needs much more RAM
> to hold an even larger term index (because you have an even larger
> termsindex with different fields).
>
> The user told, he is using NumericField, so the uniquness is irrelevant
> here, strings are never used. To make the termindex smaller and reduce ram
> usage, the only suggestion I have is to use a precisionStep of
> Integer.MAX_VALUE for all NumericField that are solely used for sorting. The
> additional terms are only needed for NumericRangeQueries.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Tuesday, August 16, 2011 8:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625
> million
>> row table...???
>>
>> Using a new field with coarser granularity will work fine, this a common
> thing
>> to do for this kind of issue.
>>
>> Lucene is trying to load 625M longs into memory, in addition to any other
> stuff.
>> Ouch!
>>
>> If you want to get really clever, you can index several fields, say year,
> month,
>> and day for each date.. The total number of unique values that need to be
>> sorted then is (the number of years in your corpus + 12 + 31). Very few
> unique
>> values in all. And you can extend this to hours, minutes, seconds and
>> milliseconds which adds a piddling 2,084 unique terms. Of course all your
> sorts
>> have to be re-written to take all these field into account, but it's
> do-able.
>>
>> Warning: This has some gotchas, but.... There is one other thing you can
> try,
>> that's sorting by INDEXORDER. This would only work for you if you index
> the
>> records in date order in the first place, so the first document you
> indexed was
>> the oldest, the second the next-oldest, etc. This won't work if you update
>> existing documents since updates are really delete/add and would mess this
>> ordering up. But the docs don't change, this might do.
>>
>> Best
>> Erick
>>
>>
>> On Tue, Aug 16, 2011 at 10:11 AM, Bennett, Tony <Bennett.Tony@con-
>> way.com> wrote:
>> > Thank you for your response.
>> >
>> > You are correct, we are sorting on timestamp.
>> > Timestamp has microsecond granualarity, and we are storing it as
>> > "NumericField".
>> >
>> > We are sorting on timestamp, so that we can give our users the most
>> > "current" matches, since we are limiting the number of responses to
>> > about 1000.  We are concerned that limiting the number of responses
>> > without sorting, may give the user the "oldest" matches, which is not
>> > what they want.
>> >
>> > Your suggestion about reducing the granularity of the sort is
>> > interesting.  We must "retain" the granularity of the "original"
>> > timestamp for Index maintenance purposes, but we could add another
>> > field, with a granularity of "date" instead of "date+time", which
>> > would be used for sorting only.
>> >
>> > -tony
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Tuesday, August 16, 2011 5:54 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: What kind of System Resources are required to index 625
> million
>> row table...???
>> >
>> > About your OOM. Grant asked a question that's pretty important, how
>> > many unique terms in the field(s) you sorted on? At a guess, you tried
>> > sorting on your timestamp and your timestamp has millisecond or less
>> > granularity, so there are 625M of them.
>> >
>> > Memory requirements for sorting grow as the number of *unique* terms.
>> > So you might be able to reduce the sorting requirements dramatically
>> > if you can use a coarser time granularity.
>> >
>> > And if you're storing your timestamp as a string type, that's even
>> > worse, there are 60 or so bytes of overhead for each string.... see
>> > NumericField....
>> >
>> > And if you can't reduce the granularity of the timestamp, there are
>> > some interesting techniques for reducing the memory requirements of
>> > timestamps that you sort on that we can discuss....
>> >
>> > Luke can answer these questions if you point it at your index, but it
>> > may take a while to examine your index, so be patient.
>> >
>> > Best
>> > Erick
>> >
>> > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Bennett.Tony@con-
>> way.com> wrote:
>> >> Thanks for the quick response.
>> >>
>> >> As to your questions:
>> >>
>> >>  Can you talk a bit more about what the search part of this is?
>> >>  What are you hoping to get that you don't already have by adding in
>> >> search?  Choices for fields can have impact on
>> >>  performance, memory, etc.
>> >>
>> >> We currently have a "exact match" search facility, which uses SQL.
>> >> We would like to add "text search" capabilities...
>> >> ...initially, having the ability to search the 229 character field for
> a given
>> word, or phrase, instead of an exact match.
>> >> A future enhancement would be to add a synonym list.
>> >> As to "field choice", yes, it is possible that all fields would be
> involved in the
>> "search"...
>> >> ...in the interest of full disclosure, the fields are:
>> >>   - corp  - corporation that owns the document
>> >>   - type  - document type
>> >>   - tmst  - creation timestamp
>> >>   - xmlid - xml namespace ID
>> >>   - tag   - meta data qualifier
>> >>   - data  - actual metadata  (example:  carton of red 3 ring binders
>> >> )
>> >>
>> >>
>> >>
>> >>  Was this single threaded or multi-threaded?  How big was the resulting
>> index?
>> >>
>> >> The search would be a threaded application.
>> >>
>> >>  How big was the resulting index?
>> >>
>> >> The index that was built was 70 GB in size.
>> >>
>> >>  Have you tried increasing the heap size?
>> >>
>> >> We have increased the up to 4 GB... on an 8 GB machine...
>> >> That's why we'd like a methodology for calculating memory
>> >> requirements to see if this application is even feasible.
>> >>
>> >> Thanks,
>> >> -tony
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> >> Sent: Monday, August 15, 2011 2:33 PM
>> >> To: java-user@lucene.apache.org
>> >> Subject: Re: What kind of System Resources are required to index 625
>> million row table...???
>> >>
>> >>
>> >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>> >>
>> >>> We are examining the possibility of using Lucene to provide Text
>> >>> Search capabilities for a 625 million row DB2 table.
>> >>>
>> >>> The table has 6 fields, all which must be stored in the Lucene Index.
>> >>> The largest column is 229 characters, the others are 8, 12, 30, and
> 1....
>> >>> ...with an additional column that is an 8 byte integer (i.e. a 'C'
> long long).
>> >>
>> >> Can you talk a bit more about what the search part of this is?  What
> are you
>> hoping to get that you don't already have by adding in search?  Choices
> for
>> fields can have impact on performance, memory, etc.
>> >>
>> >>>
>> >>> We have written a test app on a development system (AIX 6.1), and
>> >>> have successfully Indexed 625 million rows...
>> >>> ...which took about 22 hours.
>> >>
>> >> Was this single threaded or multi-threaded?  How big was the resulting
>> index?
>> >>
>> >>
>> >>>
>> >>> When writing the "search" application... we find a simple version
>> >>> works, however, if we add a Filter or a "sort" to it... we get an "out
> of
>> memory" exception.
>> >>>
>> >>
>> >> How many terms do you have in your index and in the field you are
>> sorting/filtering on?  Have you tried increasing the heap size?
>> >>
>> >>
>> >>> Before continuing our research, we'd like to find a way to determine
>> >>> what system resources are required to run this kind of
> application...???
>> >>
>> >> I don't know that there is a straight forward answer here with the
>> information you've presented.  It can depend on how you intend to
>> search/sort/filter/facet, etc.  General rule of thumb is that when you get
> over
>> 100M documents, you need to shard, but you also have pretty small
> documents
>> so your mileage may vary.   I've seen indexes in your range on a single
> machine
>> (for small docs) with low search volumes, but that isn't to say it will
> work for
>> you without more insight into your documents, etc.
>> >>
>> >>> In other words, how do we calculate the memory needs...???
>> >>>
>> >>> Have others created a similar sized Index to run on a single "shared"
>> server...???
>> >>>
>> >>
>> >> Off the cuff, I think you are pushing the capabilities of doing this on
> a single
>> machine, especially the one you have spec'd out below.
>> >>
>> >>>
>> >>> Current Environment:
>> >>>
>> >>>       Lucene Version: 3.2
>> >>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build
>> >>> jvmap6460-20090215_29883
>> >>>                        (i.e. 64 bit Java 6)
>> >>>       OS:                     AIX 6.1
>> >>>       Platform:               PPC  (IBM P520)
>> >>>       cores:          2
>> >>>       Memory:         8 GB
>> >>>       jvm memory:     `       -Xms4072m -Xmx4072m
>> >>>
>> >>> Any guidance would be greatly appreciated.
>> >>>
>> >>> -tony
>> >>
>> >> --------------------------------------------
>> >> Grant Ingersoll
>> >> Lucid Imagination
>> >> http://www.lucidimagination.com
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Erick,

This is only true, if you have string fields. Once you have the long values
in FieldCache they will always use exactly the same space. Having more
fields will in contrast blow up your IndexReader, as it needs much more RAM
to hold an even larger term index (because you have an even larger
termsindex with different fields).

The user told, he is using NumericField, so the uniquness is irrelevant
here, strings are never used. To make the termindex smaller and reduce ram
usage, the only suggestion I have is to use a precisionStep of
Integer.MAX_VALUE for all NumericField that are solely used for sorting. The
additional terms are only needed for NumericRangeQueries.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, August 16, 2011 8:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625
million
> row table...???
> 
> Using a new field with coarser granularity will work fine, this a common
thing
> to do for this kind of issue.
> 
> Lucene is trying to load 625M longs into memory, in addition to any other
stuff.
> Ouch!
> 
> If you want to get really clever, you can index several fields, say year,
month,
> and day for each date.. The total number of unique values that need to be
> sorted then is (the number of years in your corpus + 12 + 31). Very few
unique
> values in all. And you can extend this to hours, minutes, seconds and
> milliseconds which adds a piddling 2,084 unique terms. Of course all your
sorts
> have to be re-written to take all these field into account, but it's
do-able.
> 
> Warning: This has some gotchas, but.... There is one other thing you can
try,
> that's sorting by INDEXORDER. This would only work for you if you index
the
> records in date order in the first place, so the first document you
indexed was
> the oldest, the second the next-oldest, etc. This won't work if you update
> existing documents since updates are really delete/add and would mess this
> ordering up. But the docs don't change, this might do.
> 
> Best
> Erick
> 
> 
> On Tue, Aug 16, 2011 at 10:11 AM, Bennett, Tony <Bennett.Tony@con-
> way.com> wrote:
> > Thank you for your response.
> >
> > You are correct, we are sorting on timestamp.
> > Timestamp has microsecond granualarity, and we are storing it as
> > "NumericField".
> >
> > We are sorting on timestamp, so that we can give our users the most
> > "current" matches, since we are limiting the number of responses to
> > about 1000.  We are concerned that limiting the number of responses
> > without sorting, may give the user the "oldest" matches, which is not
> > what they want.
> >
> > Your suggestion about reducing the granularity of the sort is
> > interesting.  We must "retain" the granularity of the "original"
> > timestamp for Index maintenance purposes, but we could add another
> > field, with a granularity of "date" instead of "date+time", which
> > would be used for sorting only.
> >
> > -tony
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Tuesday, August 16, 2011 5:54 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: What kind of System Resources are required to index 625
million
> row table...???
> >
> > About your OOM. Grant asked a question that's pretty important, how
> > many unique terms in the field(s) you sorted on? At a guess, you tried
> > sorting on your timestamp and your timestamp has millisecond or less
> > granularity, so there are 625M of them.
> >
> > Memory requirements for sorting grow as the number of *unique* terms.
> > So you might be able to reduce the sorting requirements dramatically
> > if you can use a coarser time granularity.
> >
> > And if you're storing your timestamp as a string type, that's even
> > worse, there are 60 or so bytes of overhead for each string.... see
> > NumericField....
> >
> > And if you can't reduce the granularity of the timestamp, there are
> > some interesting techniques for reducing the memory requirements of
> > timestamps that you sort on that we can discuss....
> >
> > Luke can answer these questions if you point it at your index, but it
> > may take a while to examine your index, so be patient.
> >
> > Best
> > Erick
> >
> > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Bennett.Tony@con-
> way.com> wrote:
> >> Thanks for the quick response.
> >>
> >> As to your questions:
> >>
> >>  Can you talk a bit more about what the search part of this is?
> >>  What are you hoping to get that you don't already have by adding in
> >> search?  Choices for fields can have impact on
> >>  performance, memory, etc.
> >>
> >> We currently have a "exact match" search facility, which uses SQL.
> >> We would like to add "text search" capabilities...
> >> ...initially, having the ability to search the 229 character field for
a given
> word, or phrase, instead of an exact match.
> >> A future enhancement would be to add a synonym list.
> >> As to "field choice", yes, it is possible that all fields would be
involved in the
> "search"...
> >> ...in the interest of full disclosure, the fields are:
> >>   - corp  - corporation that owns the document
> >>   - type  - document type
> >>   - tmst  - creation timestamp
> >>   - xmlid - xml namespace ID
> >>   - tag   - meta data qualifier
> >>   - data  - actual metadata  (example:  carton of red 3 ring binders
> >> )
> >>
> >>
> >>
> >>  Was this single threaded or multi-threaded?  How big was the resulting
> index?
> >>
> >> The search would be a threaded application.
> >>
> >>  How big was the resulting index?
> >>
> >> The index that was built was 70 GB in size.
> >>
> >>  Have you tried increasing the heap size?
> >>
> >> We have increased the up to 4 GB... on an 8 GB machine...
> >> That's why we'd like a methodology for calculating memory
> >> requirements to see if this application is even feasible.
> >>
> >> Thanks,
> >> -tony
> >>
> >>
> >> -----Original Message-----
> >> From: Grant Ingersoll [mailto:gsingers@apache.org]
> >> Sent: Monday, August 15, 2011 2:33 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: What kind of System Resources are required to index 625
> million row table...???
> >>
> >>
> >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
> >>
> >>> We are examining the possibility of using Lucene to provide Text
> >>> Search capabilities for a 625 million row DB2 table.
> >>>
> >>> The table has 6 fields, all which must be stored in the Lucene Index.
> >>> The largest column is 229 characters, the others are 8, 12, 30, and
1....
> >>> ...with an additional column that is an 8 byte integer (i.e. a 'C'
long long).
> >>
> >> Can you talk a bit more about what the search part of this is?  What
are you
> hoping to get that you don't already have by adding in search?  Choices
for
> fields can have impact on performance, memory, etc.
> >>
> >>>
> >>> We have written a test app on a development system (AIX 6.1), and
> >>> have successfully Indexed 625 million rows...
> >>> ...which took about 22 hours.
> >>
> >> Was this single threaded or multi-threaded?  How big was the resulting
> index?
> >>
> >>
> >>>
> >>> When writing the "search" application... we find a simple version
> >>> works, however, if we add a Filter or a "sort" to it... we get an "out
of
> memory" exception.
> >>>
> >>
> >> How many terms do you have in your index and in the field you are
> sorting/filtering on?  Have you tried increasing the heap size?
> >>
> >>
> >>> Before continuing our research, we'd like to find a way to determine
> >>> what system resources are required to run this kind of
application...???
> >>
> >> I don't know that there is a straight forward answer here with the
> information you've presented.  It can depend on how you intend to
> search/sort/filter/facet, etc.  General rule of thumb is that when you get
over
> 100M documents, you need to shard, but you also have pretty small
documents
> so your mileage may vary.   I've seen indexes in your range on a single
machine
> (for small docs) with low search volumes, but that isn't to say it will
work for
> you without more insight into your documents, etc.
> >>
> >>> In other words, how do we calculate the memory needs...???
> >>>
> >>> Have others created a similar sized Index to run on a single "shared"
> server...???
> >>>
> >>
> >> Off the cuff, I think you are pushing the capabilities of doing this on
a single
> machine, especially the one you have spec'd out below.
> >>
> >>>
> >>> Current Environment:
> >>>
> >>>       Lucene Version: 3.2
> >>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build
> >>> jvmap6460-20090215_29883
> >>>                        (i.e. 64 bit Java 6)
> >>>       OS:                     AIX 6.1
> >>>       Platform:               PPC  (IBM P520)
> >>>       cores:          2
> >>>       Memory:         8 GB
> >>>       jvm memory:     `       -Xms4072m -Xmx4072m
> >>>
> >>> Any guidance would be greatly appreciated.
> >>>
> >>> -tony
> >>
> >> --------------------------------------------
> >> Grant Ingersoll
> >> Lucid Imagination
> >> http://www.lucidimagination.com
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Erick Erickson <er...@gmail.com>.

Using a new field with coarser granularity will work fine,
this a common thing to do for this kind of issue.

Lucene is trying to load 625M longs into memory, in
addition to any other stuff. Ouch!

If you want to get really clever, you can index several
fields, say year, month, and day for each date.. The total
number of unique values that need to be sorted then is
(the number of years in your corpus + 12 + 31). Very
few unique values in all. And you can extend this
to hours, minutes, seconds and milliseconds which adds
a piddling 2,084 unique terms. Of course all your
sorts have to be re-written to take all these field
into account, but it's do-able.

Warning: This has some gotchas, but.... There is one other
thing you can try, that's sorting by INDEXORDER. This
would only work for you if you index the records in date
order in the first place, so the first document you indexed
was the oldest, the second the next-oldest, etc. This won't
work if you update existing documents since updates are
really delete/add and would mess this ordering up. But the
docs don't change, this might do.

Best
Erick


On Tue, Aug 16, 2011 at 10:11 AM, Bennett, Tony
<Be...@con-way.com> wrote:
> Thank you for your response.
>
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, and we are
> storing it as "NumericField".
>
> We are sorting on timestamp, so that we can give our
> users the most "current" matches, since we are limiting
> the number of responses to about 1000.  We are concerned
> that limiting the number of responses without sorting,
> may give the user the "oldest" matches, which is not
> what they want.
>
> Your suggestion about reducing the granularity of the
> sort is interesting.  We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of
> "date" instead of "date+time", which would be used for
> sorting only.
>
> -tony
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, August 16, 2011 5:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>
> About your OOM. Grant asked a question that's pretty important,
> how many unique terms in the field(s) you sorted on? At a guess,
> you tried sorting on your timestamp and your timestamp has
> millisecond or less granularity, so there are 625M of them.
>
> Memory requirements for sorting grow as the number of *unique*
> terms. So you might be able to reduce the sorting requirements
> dramatically if you can use a coarser time granularity.
>
> And if you're storing your timestamp as a string type, that's
> even worse, there are 60 or so bytes of overhead for
> each string.... see NumericField....
>
> And if you can't reduce the granularity of the timestamp, there
> are some interesting techniques for reducing the memory
> requirements of timestamps that you sort on that we can discuss....
>
> Luke can answer these questions if you point it at your index,
> but it may take a while to examine your index, so be patient.
>
> Best
> Erick
>
> On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
>> Thanks for the quick response.
>>
>> As to your questions:
>>
>>  Can you talk a bit more about what the search part of this is?
>>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>>  performance, memory, etc.
>>
>> We currently have a "exact match" search facility, which uses SQL.
>> We would like to add "text search" capabilities...
>> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
>> A future enhancement would be to add a synonym list.
>> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
>> ...in the interest of full disclosure, the fields are:
>>   - corp  - corporation that owns the document
>>   - type  - document type
>>   - tmst  - creation timestamp
>>   - xmlid - xml namespace ID
>>   - tag   - meta data qualifier
>>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>>
>>
>>
>>  Was this single threaded or multi-threaded?  How big was the resulting index?
>>
>> The search would be a threaded application.
>>
>>  How big was the resulting index?
>>
>> The index that was built was 70 GB in size.
>>
>>  Have you tried increasing the heap size?
>>
>> We have increased the up to 4 GB... on an 8 GB machine...
>> That's why we'd like a methodology for calculating memory requirements
>> to see if this application is even feasible.
>>
>> Thanks,
>> -tony
>>
>>
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Monday, August 15, 2011 2:33 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>>
>>
>> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>>
>>> We are examining the possibility of using Lucene to provide Text Search
>>> capabilities for a 625 million row DB2 table.
>>>
>>> The table has 6 fields, all which must be stored in the Lucene Index.
>>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>>
>> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>>
>>>
>>> We have written a test app on a development system (AIX 6.1),
>>> and have successfully Indexed 625 million rows...
>>> ...which took about 22 hours.
>>
>> Was this single threaded or multi-threaded?  How big was the resulting index?
>>
>>
>>>
>>> When writing the "search" application... we find a simple version works, however,
>>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>>
>>
>> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>>
>>
>>> Before continuing our research, we'd like to find a way to determine
>>> what system resources are required to run this kind of application...???
>>
>> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>>
>>> In other words, how do we calculate the memory needs...???
>>>
>>> Have others created a similar sized Index to run on a single "shared" server...???
>>>
>>
>> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>>
>>>
>>> Current Environment:
>>>
>>>       Lucene Version: 3.2
>>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>>                        (i.e. 64 bit Java 6)
>>>       OS:                     AIX 6.1
>>>       Platform:               PPC  (IBM P520)
>>>       cores:          2
>>>       Memory:         8 GB
>>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>>
>>> Any guidance would be greatly appreciated.
>>>
>>> -tony
>>
>> --------------------------------------------
>> Grant Ingersoll
>> Lucid Imagination
>> http://www.lucidimagination.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Posted by "Bennett, Tony" <Be...@con-way.com>.

Thank you for your response.

You are correct, we are sorting on timestamp.
Timestamp has microsecond granualarity, and we are
storing it as "NumericField".

We are sorting on timestamp, so that we can give our
users the most "current" matches, since we are limiting
the number of responses to about 1000.  We are concerned
that limiting the number of responses without sorting,
may give the user the "oldest" matches, which is not 
what they want.

Your suggestion about reducing the granularity of the 
sort is interesting.  We must "retain" the granularity
of the "original" timestamp for Index maintenance purposes,
but we could add another field, with a granularity of 
"date" instead of "date+time", which would be used for 
sorting only. 

-tony

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, August 16, 2011 5:54 AM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million row table...???

About your OOM. Grant asked a question that's pretty important,
how many unique terms in the field(s) you sorted on? At a guess,
you tried sorting on your timestamp and your timestamp has
millisecond or less granularity, so there are 625M of them.

Memory requirements for sorting grow as the number of *unique*
terms. So you might be able to reduce the sorting requirements
dramatically if you can use a coarser time granularity.

And if you're storing your timestamp as a string type, that's
even worse, there are 60 or so bytes of overhead for
each string.... see NumericField....

And if you can't reduce the granularity of the timestamp, there
are some interesting techniques for reducing the memory
requirements of timestamps that you sort on that we can discuss....

Luke can answer these questions if you point it at your index,
but it may take a while to examine your index, so be patient.

Best
Erick

On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
> Thanks for the quick response.
>
> As to your questions:
>
>  Can you talk a bit more about what the search part of this is?
>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>  performance, memory, etc.
>
> We currently have a "exact match" search facility, which uses SQL.
> We would like to add "text search" capabilities...
> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
> A future enhancement would be to add a synonym list.
> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
> ...in the interest of full disclosure, the fields are:
>   - corp  - corporation that owns the document
>   - type  - document type
>   - tmst  - creation timestamp
>   - xmlid - xml namespace ID
>   - tag   - meta data qualifier
>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>
>
>
>  Was this single threaded or multi-threaded?  How big was the resulting index?
>
> The search would be a threaded application.
>
>  How big was the resulting index?
>
> The index that was built was 70 GB in size.
>
>  Have you tried increasing the heap size?
>
> We have increased the up to 4 GB... on an 8 GB machine...
> That's why we'd like a methodology for calculating memory requirements
> to see if this application is even feasible.
>
> Thanks,
> -tony
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, August 15, 2011 2:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>
>
> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>
>> We are examining the possibility of using Lucene to provide Text Search
>> capabilities for a 625 million row DB2 table.
>>
>> The table has 6 fields, all which must be stored in the Lucene Index.
>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>
> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>
>>
>> We have written a test app on a development system (AIX 6.1),
>> and have successfully Indexed 625 million rows...
>> ...which took about 22 hours.
>
> Was this single threaded or multi-threaded?  How big was the resulting index?
>
>
>>
>> When writing the "search" application... we find a simple version works, however,
>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>
>
> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>
>
>> Before continuing our research, we'd like to find a way to determine
>> what system resources are required to run this kind of application...???
>
> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>
>> In other words, how do we calculate the memory needs...???
>>
>> Have others created a similar sized Index to run on a single "shared" server...???
>>
>
> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>
>>
>> Current Environment:
>>
>>       Lucene Version: 3.2
>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>                        (i.e. 64 bit Java 6)
>>       OS:                     AIX 6.1
>>       Platform:               PPC  (IBM P520)
>>       cores:          2
>>       Memory:         8 GB
>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>
>> Any guidance would be greatly appreciated.
>>
>> -tony
>
> --------------------------------------------
> Grant Ingersoll
> Lucid Imagination
> http://www.lucidimagination.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Erick Erickson <er...@gmail.com>.

About your OOM. Grant asked a question that's pretty important,
how many unique terms in the field(s) you sorted on? At a guess,
you tried sorting on your timestamp and your timestamp has
millisecond or less granularity, so there are 625M of them.

Memory requirements for sorting grow as the number of *unique*
terms. So you might be able to reduce the sorting requirements
dramatically if you can use a coarser time granularity.

And if you're storing your timestamp as a string type, that's
even worse, there are 60 or so bytes of overhead for
each string.... see NumericField....

And if you can't reduce the granularity of the timestamp, there
are some interesting techniques for reducing the memory
requirements of timestamps that you sort on that we can discuss....

Luke can answer these questions if you point it at your index,
but it may take a while to examine your index, so be patient.

Best
Erick

On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
> Thanks for the quick response.
>
> As to your questions:
>
>  Can you talk a bit more about what the search part of this is?
>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>  performance, memory, etc.
>
> We currently have a "exact match" search facility, which uses SQL.
> We would like to add "text search" capabilities...
> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
> A future enhancement would be to add a synonym list.
> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
> ...in the interest of full disclosure, the fields are:
>   - corp  - corporation that owns the document
>   - type  - document type
>   - tmst  - creation timestamp
>   - xmlid - xml namespace ID
>   - tag   - meta data qualifier
>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>
>
>
>  Was this single threaded or multi-threaded?  How big was the resulting index?
>
> The search would be a threaded application.
>
>  How big was the resulting index?
>
> The index that was built was 70 GB in size.
>
>  Have you tried increasing the heap size?
>
> We have increased the up to 4 GB... on an 8 GB machine...
> That's why we'd like a methodology for calculating memory requirements
> to see if this application is even feasible.
>
> Thanks,
> -tony
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, August 15, 2011 2:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>
>
> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>
>> We are examining the possibility of using Lucene to provide Text Search
>> capabilities for a 625 million row DB2 table.
>>
>> The table has 6 fields, all which must be stored in the Lucene Index.
>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>
> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>
>>
>> We have written a test app on a development system (AIX 6.1),
>> and have successfully Indexed 625 million rows...
>> ...which took about 22 hours.
>
> Was this single threaded or multi-threaded?  How big was the resulting index?
>
>
>>
>> When writing the "search" application... we find a simple version works, however,
>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>
>
> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>
>
>> Before continuing our research, we'd like to find a way to determine
>> what system resources are required to run this kind of application...???
>
> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>
>> In other words, how do we calculate the memory needs...???
>>
>> Have others created a similar sized Index to run on a single "shared" server...???
>>
>
> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>
>>
>> Current Environment:
>>
>>       Lucene Version: 3.2
>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>                        (i.e. 64 bit Java 6)
>>       OS:                     AIX 6.1
>>       Platform:               PPC  (IBM P520)
>>       cores:          2
>>       Memory:         8 GB
>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>
>> Any guidance would be greatly appreciated.
>>
>> -tony
>
> --------------------------------------------
> Grant Ingersoll
> Lucid Imagination
> http://www.lucidimagination.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Glen Newton <gl...@gmail.com>.

> We have increased the up to 4 GB... on an 8 GB machine...
> That's why we'd like a methodology for calculating memory requirements
> to see if this application is even feasible.

Please indicate when you are speaking about the indexing part or the
searching part. There are times where it is not clear or ambiguous.
:-)

The IBM Java VM has a limitation on the size of an NIO buffer. The
default is 64MB. This may be impacting your indexing and searching.
Consider setting this to a larger size
(-XX:MaxDirectMemorySize=<size>). Perhaps similar to your RAMBuffer
size in your IndexWriter (assuming NIOFSDirectory directory). See
https://www.ibm.com/developerworks/java/jdk/aix/j664/sdkguide.aix64.html

With regards to the machine, you didn't indicate how much swap you were using.

Heap: hnless there are other things running, you could try up to 7GB of heap.

You should also consider using huge pages. PPC64 supports 4K(default)
and 16M (although this is more likely to speed things up but unlikely
solve your heap problem...)
 General info for AIX and PPC:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/large_page_ovw.htm
Java vm command line:
"-Xlp<size>
    AIX: Requests the JVM to allocate the Java heap (the heap from
which Java objects are allocated) with large (16 MB) pages, if a size
is not specified. If large pages are not available, the Java heap is
allocated with the next smaller page size that is supported by the
system. AIX requires special configuration to enable large pages. For
more information about configuring AIX support for large pages, see
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftungd/doc/prftungd/large_page_ovw.htm.
The SDK supports the use of large pages only to back the Java heap
shared memory segments. The JVM uses shmget() with the SHM_LGPG and
SHM_PIN flags to allocate large pages. The -Xlp option replaces the
environment variable IBM_JAVA_LARGE_PAGE_SIZE, which is now ignored if
set.
    AIX, Linux, and Windows only: If a <size> is specified, the JVM
attempts to allocate the JIT code cache memory using pages of that
size. If unsuccessful, or if executable pages of that size are not
supported, the JIT code cache memory is allocated using the smallest
available executable page size."

 General info on huge pages & Java, MySql, Linux, AIX:
http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
 [my blog]


Consider some of the following Java VM command line options (some IBM
vm specific):
-   -Xgcpolicy:subpool    "Uses an improved object allocation
algorithm to achieve better performance when allocating objects on the
heap. This option might improve performance on large SMP systems"
-  -Xcompressedrefs   "Use -Xcompressedrefs in any of these
situations: When your Java applications does not need more than a 25
GB Java heap.     When your application uses a lot of native memory
and needs the JVM to run in a small footprint."
-  -Xcompactexplicitgc    "Enables full compaction each time
System.gc() is called."
-  -Xcompactgc   "Compacts on all garbage collections (system and global)."
-  -Xsoftrefthreshold<number> "Sets the value used by the GC to
determine the number of GCs after which a soft reference is cleared if
its referent has not been marked. The default is 32, meaning that the
soft reference is cleared after 32 * (percentage of free heap space)
GC cycles where its referent was not marked." Reducing this will clear
out soft references sooner. If any soft referenced-based caching is
being used, cache hits will go down but memory will be freed up
faster. But this will not directly solve your OOM problem: "All soft
references are guaranteed to have been cleared before the
OutOfMemoryError is thrown.
    The default (no compaction option specified) makes the GC compact
based on a series of triggers that attempt to compact only when it is
beneficial to the future performance of the JVM." - from
https://www.ibm.com/developerworks/java/jdk/aix/j664/sdkguide.aix64.html

Very useful document on IBM Java VM: "Diagnostics Guide: IBM Developer
Kit and Runtime Environment, Java: Technology Edition, Version 6"
 http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/diag60.pdf
  [page references refer to this document]
Relevant tips from this document on memory management:
- "Ensure that the heap never pages; that is, the maximum heap size
must be able to be contained in physical memory." p,8  Note that this
is a performance tip, not an OOM tip

You are using "-Xms4072m -Xmx4072m". The IBM documentation suggests
this is not a good choice:
"When you have established the maximum heap size that you need, you might
want to set the minimum heap size to the same value; for example, -Xms512M
-Xmx512M. However, using the same values is typically not a good idea,
because it
delays the start of garbage collection until the heap is full.
Therefore, the first time
that the GC runs, the process can take longer. Also, the heap is more
likely to be
fragmented and require a heap compaction. You are advised to start your
application with the minimum heap size that your application requires. When the
GC starts up, it will run frequently and efficiently, because the heap
is small." - p43

AIX allows different malloc policies to be used in the underlying
system calls. Consider using the WATSON (!) malloc policy. p.134,136
and http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm

Finally (or before doing all of this! :-)  ), do some profiling, both
inside of Java, and of the AIX native heap using svmon (see "Native
Heap Exhaustion, p.135).

-Glen Newton
http://zzzoot.blogspot.com/




On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Be...@con-way.com> wrote:
> Thanks for the quick response.
>
> As to your questions:
>
>  Can you talk a bit more about what the search part of this is?
>  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on
>  performance, memory, etc.
>
> We currently have a "exact match" search facility, which uses SQL.
> We would like to add "text search" capabilities...
> ...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
> A future enhancement would be to add a synonym list.
> As to "field choice", yes, it is possible that all fields would be involved in the "search"...
> ...in the interest of full disclosure, the fields are:
>   - corp  - corporation that owns the document
>   - type  - document type
>   - tmst  - creation timestamp
>   - xmlid - xml namespace ID
>   - tag   - meta data qualifier
>   - data  - actual metadata  (example:  carton of red 3 ring binders )
>
>
>
>  Was this single threaded or multi-threaded?  How big was the resulting index?
>
> The search would be a threaded application.
>
>  How big was the resulting index?
>
> The index that was built was 70 GB in size.
>
>  Have you tried increasing the heap size?
>
> We have increased the up to 4 GB... on an 8 GB machine...
> That's why we'd like a methodology for calculating memory requirements
> to see if this application is even feasible.
>
> Thanks,
> -tony
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, August 15, 2011 2:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 million row table...???
>
>
> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>
>> We are examining the possibility of using Lucene to provide Text Search
>> capabilities for a 625 million row DB2 table.
>>
>> The table has 6 fields, all which must be stored in the Lucene Index.
>> The largest column is 229 characters, the others are 8, 12, 30, and 1....
>> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).
>
> Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.
>
>>
>> We have written a test app on a development system (AIX 6.1),
>> and have successfully Indexed 625 million rows...
>> ...which took about 22 hours.
>
> Was this single threaded or multi-threaded?  How big was the resulting index?
>
>
>>
>> When writing the "search" application... we find a simple version works, however,
>> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
>>
>
> How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?
>
>
>> Before continuing our research, we'd like to find a way to determine
>> what system resources are required to run this kind of application...???
>
> I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.
>
>> In other words, how do we calculate the memory needs...???
>>
>> Have others created a similar sized Index to run on a single "shared" server...???
>>
>
> Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.
>
>>
>> Current Environment:
>>
>>       Lucene Version: 3.2
>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>>                        (i.e. 64 bit Java 6)
>>       OS:                     AIX 6.1
>>       Platform:               PPC  (IBM P520)
>>       cores:          2
>>       Memory:         8 GB
>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>
>> Any guidance would be greatly appreciated.
>>
>> -tony
>
> --------------------------------------------
> Grant Ingersoll
> Lucid Imagination
> http://www.lucidimagination.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: What kind of System Resources are required to index 625 million row table...???

Posted by "Bennett, Tony" <Be...@con-way.com>.

Thanks for the quick response.

As to your questions:

  Can you talk a bit more about what the search part of this is?  
  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on 
  performance, memory, etc.

We currently have a "exact match" search facility, which uses SQL.
We would like to add "text search" capabilities...
...initially, having the ability to search the 229 character field for a given word, or phrase, instead of an exact match.
A future enhancement would be to add a synonym list.
As to "field choice", yes, it is possible that all fields would be involved in the "search"...
...in the interest of full disclosure, the fields are:
   - corp  - corporation that owns the document
   - type  - document type
   - tmst  - creation timestamp
   - xmlid - xml namespace ID
   - tag   - meta data qualifier
   - data  - actual metadata  (example:  carton of red 3 ring binders )

  Was this single threaded or multi-threaded?  How big was the resulting index?

The search would be a threaded application.

  How big was the resulting index?

The index that was built was 70 GB in size.

  Have you tried increasing the heap size?

We have increased the up to 4 GB... on an 8 GB machine...
That's why we'd like a methodology for calculating memory requirements
to see if this application is even feasible.

Thanks,
-tony 

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Monday, August 15, 2011 2:33 PM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million row table...???

On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:

> We are examining the possibility of using Lucene to provide Text Search 
> capabilities for a 625 million row DB2 table.
> 
> The table has 6 fields, all which must be stored in the Lucene Index.  
> The largest column is 229 characters, the others are 8, 12, 30, and 1....
> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).

Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.

> 
> We have written a test app on a development system (AIX 6.1),
> and have successfully Indexed 625 million rows...
> ...which took about 22 hours.

Was this single threaded or multi-threaded?  How big was the resulting index?

> 
> When writing the "search" application... we find a simple version works, however,
> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
> 

How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?

> Before continuing our research, we'd like to find a way to determine 
> what system resources are required to run this kind of application...???

I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.

> In other words, how do we calculate the memory needs...???
> 
> Have others created a similar sized Index to run on a single "shared" server...???
> 

Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.

> 
> Current Environment:
> 
> 	Lucene Version:	3.2
> 	Java Version:	J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>                        (i.e. 64 bit Java 6)
> 	OS:			AIX 6.1
> 	Platform:		PPC  (IBM P520)
> 	cores:		2
> 	Memory:		8 GB
> 	jvm memory:	`	-Xms4072m -Xmx4072m
> 
> Any guidance would be greatly appreciated.
> 
> -tony

--------------------------------------------
Grant Ingersoll
Lucid Imagination
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What kind of System Resources are required to index 625 million row table...???

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:

> We are examining the possibility of using Lucene to provide Text Search 
> capabilities for a 625 million row DB2 table.
> 
> The table has 6 fields, all which must be stored in the Lucene Index.  
> The largest column is 229 characters, the others are 8, 12, 30, and 1....
> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).

Can you talk a bit more about what the search part of this is?  What are you hoping to get that you don't already have by adding in search?  Choices for fields can have impact on performance, memory, etc.

> 
> We have written a test app on a development system (AIX 6.1),
> and have successfully Indexed 625 million rows...
> ...which took about 22 hours.

Was this single threaded or multi-threaded?  How big was the resulting index?

> 
> When writing the "search" application... we find a simple version works, however,
> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
> 

How many terms do you have in your index and in the field you are sorting/filtering on?  Have you tried increasing the heap size?

> Before continuing our research, we'd like to find a way to determine 
> what system resources are required to run this kind of application...???

I don't know that there is a straight forward answer here with the information you've presented.  It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb is that when you get over 100M documents, you need to shard, but you also have pretty small documents so your mileage may vary.   I've seen indexes in your range on a single machine (for small docs) with low search volumes, but that isn't to say it will work for you without more insight into your documents, etc.

> In other words, how do we calculate the memory needs...???
> 
> Have others created a similar sized Index to run on a single "shared" server...???
> 

Off the cuff, I think you are pushing the capabilities of doing this on a single machine, especially the one you have spec'd out below.

> 
> Current Environment:
> 
> 	Lucene Version:	3.2
> 	Java Version:	J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>                        (i.e. 64 bit Java 6)
> 	OS:			AIX 6.1
> 	Platform:		PPC  (IBM P520)
> 	cores:		2
> 	Memory:		8 GB
> 	jvm memory:	`	-Xms4072m -Xmx4072m
> 
> Any guidance would be greatly appreciated.
> 
> -tony

--------------------------------------------
Grant Ingersoll
Lucid Imagination
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org