You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by Pamela Foxcroft <pa...@gmail.com> on 2006/05/19 16:56:33 UTC

noobie question

I have been developing a C# search solution for an application which has
tens of millions of web pages. Most of these web pages are under 1 k.

While our initial pilot was very encouraging on our tests of 1,000,000 docs,
when we scaled up to 10 million subsecond searches are now taking 8-10
seconds.

Where should I focus my efforts to increase search speed? Should I be using
the RAMDirectory? MultiSearcher?

We only have one machine right now which serves indexing and searching.

TIA

Pam

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

The Compound file format is the default file format for the index that you
create (at least in v1.4.x).  When creating an index, you can specify
true/false in a constructor that indicates if you wish the index file to be
compacted or not.  Check out
http://lucene.apache.org/java/docs/fileformats.html to understand this
better.

When you're index gets to be of significant size, the file format can become
very important.  Using the default compound format, searching will tend to
be faster (assuming all other things equal) but index updates will be
slower; vice versa, searching may be slower but index updates can be
faster.  There are three other properties that can affect the mix as well:
mergefactor, minmergedocs, and maxmergedocs.  Tweaking these properties in
conjunction with the file format settings grows in importance as your index
size increases.  Check out the thread at
http://www.gossamer-threads.com/lists/lucene/java-user/11999?search_string=minmergedocs;#11999
.

-- j



On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Thanks Jeff, I am a little confused by the compound vs loose file format
> you
> speak of.
>
> We are indexing html docs and indexing 10 metatags. By indexing I mean we
> index the body, but we also query the properties. I am not sure what the
> correct definition is.
>
> Are you saying that if we were merely indexing the document bodies we
> would
> be further ahead? We need to restrict our searches by date, and a few
> other
> properties, so its really important that we be able to do these
> restrictions.
>
> TIA
>
> Pam
>
>
> On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> >
> > Hi Pamela -
> >
> > Performance certainly changes as your index grows, and it's not even
> > necessarily a linear progression.  How you indexed your data,
> compression
> > factors, compound vs. loose file format, number of indexes, etc. all
> play
> > a
> > part in affecting search performance at runtime.
> >
> > There are a lot of places to look for improvements.  I would suggest
> > looking
> > at your specific indexes and see if you can break those up into smaller
> > indexes -- this will lead you to the MultiSearcher (and, if you have
> > multi-processor hardware, ParallelMultiSearcher).
> >
> > Leave your index updating operation out of the picture for the moment.
> > Indexing can have a big impact on search performance, so take that out
> of
> > the equation.  After you're able to get to better runtime search
> > performance, go back and add indexing to the mix.  I can tell you from
> > experience that most search systems with indexes of substantial size are
> > executing indexing operations on separate systems to avoid performance
> > impacts.
> >
> > Hope this helps.
> >
> > -- j
> >
> >
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > I have been developing a C# search solution for an application which
> has
> > > tens of millions of web pages. Most of these web pages are under 1 k.
> > >
> > > While our initial pilot was very encouraging on our tests of 1,000,000
> > > docs,
> > > when we scaled up to 10 million subsecond searches are now taking 8-10
> > > seconds.
> > >
> > > Where should I focus my efforts to increase search speed? Should I be
> > > using
> > > the RAMDirectory? MultiSearcher?
> > >
> > > We only have one machine right now which serves indexing and
> searching.
> > >
> > > TIA
> > >
> > > Pam
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

Thanks Jeff, I am a little confused by the compound vs loose file format you
speak of.

We are indexing html docs and indexing 10 metatags. By indexing I mean we
index the body, but we also query the properties. I am not sure what the
correct definition is.

Are you saying that if we were merely indexing the document bodies we would
be further ahead? We need to restrict our searches by date, and a few other
properties, so its really important that we be able to do these
restrictions.

TIA

Pam


On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
>
> Hi Pamela -
>
> Performance certainly changes as your index grows, and it's not even
> necessarily a linear progression.  How you indexed your data, compression
> factors, compound vs. loose file format, number of indexes, etc. all play
> a
> part in affecting search performance at runtime.
>
> There are a lot of places to look for improvements.  I would suggest
> looking
> at your specific indexes and see if you can break those up into smaller
> indexes -- this will lead you to the MultiSearcher (and, if you have
> multi-processor hardware, ParallelMultiSearcher).
>
> Leave your index updating operation out of the picture for the moment.
> Indexing can have a big impact on search performance, so take that out of
> the equation.  After you're able to get to better runtime search
> performance, go back and add indexing to the mix.  I can tell you from
> experience that most search systems with indexes of substantial size are
> executing indexing operations on separate systems to avoid performance
> impacts.
>
> Hope this helps.
>
> -- j
>
>
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > I have been developing a C# search solution for an application which has
> > tens of millions of web pages. Most of these web pages are under 1 k.
> >
> > While our initial pilot was very encouraging on our tests of 1,000,000
> > docs,
> > when we scaled up to 10 million subsecond searches are now taking 8-10
> > seconds.
> >
> > Where should I focus my efforts to increase search speed? Should I be
> > using
> > the RAMDirectory? MultiSearcher?
> >
> > We only have one machine right now which serves indexing and searching.
> >
> > TIA
> >
> > Pam
> >
> >
>
>

RE: noobie question

Posted by George Aroush <ge...@aroush.net>.

Ahh, I wasn't thinking of 64bit OS.  Speaking of which, have you or has
anyone compiled Lucene.Net or Java Lucene for that matter, as 64bit
application and got it running?

-- George Aroush

-----Original Message-----
From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com] 
Sent: Monday, May 22, 2006 5:44 PM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

You could certainly load a 7gb index into memory, given sufficient hardware
running 64-bit Windows.  That said, I wouldn't suggest trying to carry a
single 7gb index in a single server's memory.

Keeping an index below a 2Gb threshold is only treating a symptom and isn't
really sustainable if your index is already in the 7Gb range.  The issue at
hand is dealing with the indexed data as efficiently as possible.  Following
George's suggestion for stripping the index down, i.e. just using searchable
entities, is one possible approach.  In our situation, we have quite a few
fields of data that would be performance hits elsewhere on our system to
retrieve at search run-time, so the lesser evil is to include them in our
index.  Just depends on your requirements to determine what's best.
Likewise, monitoring your hardware statistics for bottlenecks aren't
invalid, but I doubt you'll be able to make the modifications necessary to
achieve the results you'd like to see on hardware config changes alone.

Based on the conversation we've had thus far and a few assumptions on my
part, I doubt you'll be able to keep your search times anywhere near the
thresholds you'd like to see.  You can help yourself with reduced index
size, tweaked hardware configurations, and indexing strategies, but there is
no silver bullet here.  If my experiences hold true for you, you'll end up
addressing each of these areas as your look for efficiencies of scale.

-- j

On 5/22/06, George Aroush <ge...@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows 
> application can't access more then 2Gb of RAM -- so if a machine has 
> 8Gg and only Lucene is running chance are that you still have a lot of 
> real memory not being used.
>
> You need to investigate and find out why your index grew to 7Gb and 
> reduce it's size.  For example, are you storing any data in Lucene's 
> index?  If so, consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance 
> are that it is and if queries are still taking log to run then your 
> focus should be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 
> Gigs or RAM on our machine? Can we get any other hardware specs? IE 2, 4
procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost 
> in search perf.  We run dual-proc Xeons for our search servers, as CPU 
> has been the bottleneck.  Sorts are particularly egregious when it 
> comes to CPU load as well.  Bang for the buck, running the new 
> dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than 
> date, and a 10 byte string for one of the metatags, the metatags are 
> almost always empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do 
> we partition? It looks like we need to optimize our IO for reads which 
> means raid 5 or a solid state ram drive to me. Is this correct? Could 
> we perhaps cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big 
> help.
> In our instance, warm up queries need to be logically distributed to 
> hit all the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7 
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 
> > 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than 
> > date, and a 10 byte string for one of the metatags, the metatags are 
> > almost always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do 
> > we partition? It looks like we need to optimize our IO for reads 
> > which means raid 5 or a solid state ram drive to me. Is this 
> > correct? Could we perhaps cache it in ram (file system cache) by 
> > issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside 
> > > the usual of having a fast CPU and max out your memory, make sure 
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O 
> > > bound, thus a fast hard drive is critical.  Simply moving from 
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID 
> > > hard rive and you will even see better results.  And yet even 
> > > better, if you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your 
> > > Lucene index?
> > > If so, consider not doing it.  The goal is to keep the index size 
> > > as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but 
> > > compactness also affects search performance as your index gets 
> > > larger.  As you incrementally update the index, the fragmentation 
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We 
> > > have about 7-8GB of indexes of varying structure, and those are 
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, 
> > > as the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated 
> > > across multiple machines (no duplication).  As a rule, we also try 
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes 
> > > more effectively if you don't need to search the full corpus all 
> > > the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be 
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as 
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine 
> > > > how aggressively the index is compacted? And if so, doesn't this 
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be 
> > > > partitioning the indexes, ie by date range? So one index for 
> > > > Decemeber 2005, one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's 
> > > > > not even necessarily a linear progression.  How you indexed 
> > > > > your data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would 
> > > > > suggest looking at your specific indexes and see if you can 
> > > > > break those up into smaller indexes -- this will lead you to 
> > > > > the MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take 
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime 
> > > > > search performance, go back and add indexing to the mix.  I 
> > > > > can tell you from experience that most search systems with 
> > > > > indexes of substantial size are executing indexing operations 
> > > > > on separate systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an 
> > > > > > application which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are 
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of 
> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond 
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

You could certainly load a 7gb index into memory, given sufficient hardware
running 64-bit Windows.  That said, I wouldn't suggest trying to carry a
single 7gb index in a single server's memory.

Keeping an index below a 2Gb threshold is only treating a symptom and isn't
really sustainable if your index is already in the 7Gb range.  The issue at
hand is dealing with the indexed data as efficiently as possible.  Following
George's suggestion for stripping the index down, i.e. just using searchable
entities, is one possible approach.  In our situation, we have quite a few
fields of data that would be performance hits elsewhere on our system to
retrieve at search run-time, so the lesser evil is to include them in our
index.  Just depends on your requirements to determine what's best.
Likewise, monitoring your hardware statistics for bottlenecks aren't
invalid, but I doubt you'll be able to make the modifications necessary to
achieve the results you'd like to see on hardware config changes alone.

Based on the conversation we've had thus far and a few assumptions on my
part, I doubt you'll be able to keep your search times anywhere near the
thresholds you'd like to see.  You can help yourself with reduced index
size, tweaked hardware configurations, and indexing strategies, but there is
no silver bullet here.  If my experiences hold true for you, you'll end up
addressing each of these areas as your look for efficiencies of scale.

-- j

On 5/22/06, George Aroush <ge...@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows application
> can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> Lucene
> is running chance are that you still have a lot of real memory not being
> used.
>
> You need to investigate and find out why your index grew to 7Gb and reduce
> it's size.  For example, are you storing any data in Lucene's index?  If
> so,
> consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance are
> that it is and if queries are still taking log to run then your focus
> should
> be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
> or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost in
> search perf.  We run dual-proc Xeons for our search servers, as CPU has
> been
> the bottleneck.  Sorts are particularly egregious when it comes to CPU
> load
> as well.  Bang for the buck, running the new dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than date,
> and
> a 10 byte string for one of the metatags, the metatags are almost always
> empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do we
> partition? It looks like we need to optimize our IO for reads which means
> raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
> cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big
> help.
> In our instance, warm up queries need to be logically distributed to hit
> all
> the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than date,
> > and a 10 byte string for one of the metatags, the metatags are almost
> > always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do we
> > partition? It looks like we need to optimize our IO for reads which
> > means raid 5 or a solid state ram drive to me. Is this correct? Could
> > we perhaps cache it in ram (file system cache) by issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside
> > > the usual of having a fast CPU and max out your memory, make sure
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > bound, thus a fast hard drive is critical.  Simply moving from
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID hard
> > > rive and you will even see better results.  And yet even better, if
> > > you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your Lucene
> > > index?
> > > If so, consider not doing it.  The goal is to keep the index size as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but
> > > compactness also affects search performance as your index gets
> > > larger.  As you incrementally update the index, the fragmentation
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We
> > > have about 7-8GB of indexes of varying structure, and those are
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, as
> > > the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated
> > > across multiple machines (no duplication).  As a rule, we also try
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes
> > > more effectively if you don't need to search the full corpus all the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine how
> > > > aggressively the index is compacted? And if so, doesn't this
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be partitioning
> > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's not
> > > > > even necessarily a linear progression.  How you indexed your
> > > > > data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would
> > > > > suggest looking at your specific indexes and see if you can
> > > > > break those up into smaller indexes -- this will lead you to the
> > > > > MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime search
> > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > you from experience that most search systems with indexes of
> > > > > substantial size are executing indexing operations on separate
> > > > > systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an application
> > > > > > which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

RE: noobie question

Posted by George Aroush <ge...@aroush.net>.

For my solution, the only thing I store in the Lucene index is the primarily
key.  This kind of a solution allows me to keep the Lucene index as small as
possible, which means searching and updating the index is fast.

Anything which is post search -- extracting hit snippets, highlighting, etc
-- are done by another process which I can easily host on another server.

If you design your system along those lines, you can provide a scalable
solution.  Also, I would suggest that you design your solution for fast
searching first, and take care of indexing, highlighting, etc later.

-- George Aroush

-----Original Message-----
From: Pamela Foxcroft [mailto:pamelafoxcroft@gmail.com] 
Sent: Wednesday, May 24, 2006 12:11 PM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

Hi Jeff & George

OK, I guess we are stroing a lot of data in our index. Basically we are
storing 10 metags and their values. The only ones which is always populated
our are Primary key value, and our date value (we are indexing a database).
The rest are almost always empty.

Pam


On 5/23/06, Jeff Rodenburg <je...@gmail.com> wrote:
>
> Hi Pam -
>
> > I am confused, what do you mean by storing data in my index?
> (George, correct me if I'm wrong here.)
>
> What George is referring to is the different manners in which data can 
> be included in an index.  Take a look at the Field class and you'll 
> notice a series of static methods that store data in a number of ways.  
> The static methods define four different ways to include data in an 
> index -- Keyword, Unindexed, Unstored, and Text.  These are just 
> wrapper definitions for indexing, storing and tokenizing index
information.
>
> "Indexing" means including data with a field that would be searchable.
> "Storing" means including data with a field for presentation.
> "Tokenizing" means using analyzed data with a field that's been 
> designated as indexed (searchable).
>
> For the four static methods:
> Keyword - values are indexed (searchable) and stored but not tokenized 
> Unindexed - values are stored but not indexed or tokenized Unstored - 
> values are indexed and tokenized (searchable) but not stored Text - 
> values are indexed, tokenized and stored
>
> In making decisions about index composition, choose the field storage 
> method that best matches the need for your particular data field.  The 
> fewer data fields you need, the smaller the index, the better the 
> performance.
>
>
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> That's why the list is here.  :-)
>
> -- j
>
>
>
> On 5/23/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi George
> >
> > I am confused, what do you mean by storing data in my index?
> >
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> >
> > Pam
> >
> >
> > On 5/22/06, George Aroush <ge...@aroush.net> wrote:
> > >
> > > Hi Pam and Jeff,
> > >
> > > You can't load 7Gb of index into memory.  A typical Windows
> application
> > > can't access more then 2Gb of RAM -- so if a machine has 8Gg and 
> > > only Lucene is running chance are that you still have a lot of 
> > > real memory not
> being
> > > used.
> > >
> > > You need to investigate and find out why your index grew to 7Gb 
> > > and
> > reduce
> > > it's size.  For example, are you storing any data in Lucene's
> index?  If
> > > so,
> > > consider not doing so.
> > >
> > > Monitor your CPU and see that it is being max'ed out or not.  
> > > Chance
> are
> > > that it is and if queries are still taking log to run then your 
> > > focus should be on disk I/O.
> > >
> > > Regards,
> > >
> > > -- George Aroush
> > >
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Saturday, May 20, 2006 11:18 AM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > - Our index is currently 7 Gigs. I take it we should have more 
> > > than 7
> > Gigs
> > > or RAM on our machine? Can we get any other hardware specs? IE 2, 
> > > 4
> > procs?
> > >
> > > You can go with big RAM, but I haven't found that to be a huge 
> > > boost
> in
> > > search perf.  We run dual-proc Xeons for our search servers, as 
> > > CPU
> has
> > > been
> > > the bottleneck.  Sorts are particularly egregious when it comes to 
> > > CPU load as well.  Bang for the buck, running the new dual-core 
> > > Opterons are
> > > *amazingly* strong performers.
> > >
> > > - Each html doc we have has 10 metatags which we store. Other than
> date,
> > > and
> > > a 10 byte string for one of the metatags, the metatags are almost
> always
> > > empty. Will this degrade performance?
> > >
> > > I would not expect this to degrade your performance.
> > >
> > > - Also when you suggest we distribute our index, on what criteria 
> > > do
> we
> > > partition? It looks like we need to optimize our IO for reads 
> > > which
> > means
> > > raid 5 or a solid state ram drive to me. Is this correct? Could we
> > perhaps
> > > cache it in ram (file system cache) by issuing warm up queries?
> > >
> > > The faster your disk, the better.  And yes, warm-up queries are a 
> > > big help.
> > > In our instance, warm up queries need to be logically distributed 
> > > to
> hit
> > > all
> > > the searchers.
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > Hi George
> > > >
> > > > Our index is currently 7 Gigs. I take it we should have more 
> > > > than 7 Gigs or RAM on our machine? Can we get any other hardware 
> > > > specs? IE
> 2,
> > > > 4 procs?
> > > >
> > > > Each html doc we have has 10 metatags which we store. Other than
> date,
> > > > and a 10 byte string for one of the metatags, the metatags are
> almost
> > > > always empty. Will this degrade performance?
> > > >
> > > > Also when you suggest we distribute our index, on what criteria 
> > > > do
> we
> > > > partition? It looks like we need to optimize our IO for reads 
> > > > which means raid 5 or a solid state ram drive to me. Is this
correct?
> Could
> > > > we perhaps cache it in ram (file system cache) by issuing warm 
> > > > up
> > > queries?
> > > >
> > > > BTW - we will be running on the wintel platform using c#.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > > > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > > > >
> > > > > Hi Pam,
> > > > >
> > > > > You also need to investigate your hardware configuration.  
> > > > > Beside the usual of having a fast CPU and max out your memory, 
> > > > > make sure have a fast hard drive.
> > > > >
> > > > > As a Lucene index grows, anything you do with Lucene becomes 
> > > > > I/O bound, thus a fast hard drive is critical.  Simply moving 
> > > > > from 5400rpm to 7200rpm
> > > > will
> > > > > give you a noticeable difference -- switch to a fast SCSI/RAID
> hard
> > > > > rive and you will even see better results.  And yet even 
> > > > > better,
> if
> > > > > you
> > > > distribute
> > > > > your index across multiple hard-drives/portions.
> > > > >
> > > > > One other thing to look for, are you storing any data in your
> Lucene
> > > > > index?
> > > > > If so, consider not doing it.  The goal is to keep the index 
> > > > > size
> as
> > > > small
> > > > > as possible to reduce I/O.
> > > > >
> > > > > Good luck.
> > > > >
> > > > > -- George Aroush
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > > > Sent: Friday, May 19, 2006 4:28 PM
> > > > > To: lucene-net-dev@incubator.apache.org
> > > > > Subject: Re: noobie question
> > > > >
> > > > > Yes, the merge parameters does affect indexing performance, 
> > > > > but compactness also affects search performance as your index 
> > > > > gets larger.  As you incrementally update the index, the 
> > > > > fragmentation effect (which the
> > > > merge
> > > > > properties will dictate) causes performance degradation at 
> > > > > search
> > > time.
> > > > >
> > > > > As for index size, I don't know about any hard and fast rules.  
> > > > > We have about 7-8GB of indexes of varying structure, and those 
> > > > > are spread out
> > > > over
> > > > > about 40 indexes.  We try to keep individual indexes below 
> > > > > 300MB,
> as
> > > > > the operational hassles after that size seem to be more
> burdensome.
> > > > > We also use distributed searching so our indexes are allocated 
> > > > > across multiple machines (no duplication).  As a rule, we also 
> > > > > try to stay below 2.5GB of
> > > > aggregate
> > > > > indexes on one machine.  Our indexes are a full corpus and we 
> > > > > must
> > > > search
> > > > > across all indexes all the time.  You can structure your 
> > > > > indexes more effectively if you don't need to search the full 
> > > > > corpus all
> the
> > > time.
> > > > >
> > > > > With multiple indexes being searched collectively, you'll soon 
> > > > > be using the MultiSearcher class.  Be sure to look at 
> > > > > MultiReader, as it makes a difference in search performance (nice
caching).
> > > > >
> > > > > -- j
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Jeff
> > > > > >
> > > > > > A couple more questions. Don't the merge parameters 
> > > > > > determine
> how
> > > > > > aggressively the index is compacted? And if so, doesn't this 
> > > > > > affect only indexing performance and not search performance?
> > > > > >
> > > > > > Secondly how large should each index be? Should I be
> partitioning
> > > > > > the indexes, ie by date range? So one index for Decemeber 
> > > > > > 2005, one for January, etc? Or is it done by size?
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Pamela -
> > > > > > >
> > > > > > > Performance certainly changes as your index grows, and 
> > > > > > > it's
> not
> > > > > > > even necessarily a linear progression.  How you indexed 
> > > > > > > your data,
> > > > > > compression
> > > > > > > factors, compound vs. loose file format, number of 
> > > > > > > indexes,
> etc.
> > > > > > > all
> > > > > > play
> > > > > > > a
> > > > > > > part in affecting search performance at runtime.
> > > > > > >
> > > > > > > There are a lot of places to look for improvements.  I 
> > > > > > > would suggest looking at your specific indexes and see if 
> > > > > > > you can break those up into smaller indexes -- this will 
> > > > > > > lead you to
> the
> > > > > > > MultiSearcher (and, if you have multi-processor hardware,
> > > ParallelMultiSearcher).
> > > > > > >
> > > > > > > Leave your index updating operation out of the picture for 
> > > > > > > the
> > > > moment.
> > > > > > > Indexing can have a big impact on search performance, so 
> > > > > > > take that out
> > > > > > of
> > > > > > > the equation.  After you're able to get to better runtime
> search
> > > > > > > performance, go back and add indexing to the mix.  I can 
> > > > > > > tell you from experience that most search systems with 
> > > > > > > indexes of substantial size are executing indexing 
> > > > > > > operations on separate systems to avoid performance impacts.
> > > > > > >
> > > > > > > Hope this helps.
> > > > > > >
> > > > > > > -- j
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > I have been developing a C# search solution for an
> application
> > > > > > > > which
> > > > > > has
> > > > > > > > tens of millions of web pages. Most of these web pages 
> > > > > > > > are under 1
> > > > > k.
> > > > > > > >
> > > > > > > > While our initial pilot was very encouraging on our 
> > > > > > > > tests of 1,000,000 docs, when we scaled up to 10 million 
> > > > > > > > subsecond searches are now taking 8-10 seconds.
> > > > > > > >
> > > > > > > > Where should I focus my efforts to increase search speed?
> > > > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > > > >
> > > > > > > > We only have one machine right now which serves indexing 
> > > > > > > > and
> > > > > > searching.
> > > > > > > >
> > > > > > > > TIA
> > > > > > > >
> > > > > > > > Pam
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

Hi Jeff & George

OK, I guess we are stroing a lot of data in our index. Basically we are
storing 10 metags and their values. The only ones which is always populated
our are Primary key value, and our date value (we are indexing a database).
The rest are almost always empty.

Pam


On 5/23/06, Jeff Rodenburg <je...@gmail.com> wrote:
>
> Hi Pam -
>
> > I am confused, what do you mean by storing data in my index?
> (George, correct me if I'm wrong here.)
>
> What George is referring to is the different manners in which data can be
> included in an index.  Take a look at the Field class and you'll notice a
> series of static methods that store data in a number of ways.  The static
> methods define four different ways to include data in an index -- Keyword,
> Unindexed, Unstored, and Text.  These are just wrapper definitions for
> indexing, storing and tokenizing index information.
>
> "Indexing" means including data with a field that would be searchable.
> "Storing" means including data with a field for presentation.
> "Tokenizing" means using analyzed data with a field that's been designated
> as indexed (searchable).
>
> For the four static methods:
> Keyword - values are indexed (searchable) and stored but not tokenized
> Unindexed - values are stored but not indexed or tokenized
> Unstored - values are indexed and tokenized (searchable) but not stored
> Text - values are indexed, tokenized and stored
>
> In making decisions about index composition, choose the field storage
> method
> that best matches the need for your particular data field.  The fewer data
> fields you need, the smaller the index, the better the performance.
>
>
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> That's why the list is here.  :-)
>
> -- j
>
>
>
> On 5/23/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi George
> >
> > I am confused, what do you mean by storing data in my index?
> >
> > Thanks to you and Jeff for all of your help! I really appreciate it!
> >
> > Pam
> >
> >
> > On 5/22/06, George Aroush <ge...@aroush.net> wrote:
> > >
> > > Hi Pam and Jeff,
> > >
> > > You can't load 7Gb of index into memory.  A typical Windows
> application
> > > can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> > > Lucene
> > > is running chance are that you still have a lot of real memory not
> being
> > > used.
> > >
> > > You need to investigate and find out why your index grew to 7Gb and
> > reduce
> > > it's size.  For example, are you storing any data in Lucene's
> index?  If
> > > so,
> > > consider not doing so.
> > >
> > > Monitor your CPU and see that it is being max'ed out or not.  Chance
> are
> > > that it is and if queries are still taking log to run then your focus
> > > should
> > > be on disk I/O.
> > >
> > > Regards,
> > >
> > > -- George Aroush
> > >
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Saturday, May 20, 2006 11:18 AM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > - Our index is currently 7 Gigs. I take it we should have more than 7
> > Gigs
> > > or RAM on our machine? Can we get any other hardware specs? IE 2, 4
> > procs?
> > >
> > > You can go with big RAM, but I haven't found that to be a huge boost
> in
> > > search perf.  We run dual-proc Xeons for our search servers, as CPU
> has
> > > been
> > > the bottleneck.  Sorts are particularly egregious when it comes to CPU
> > > load
> > > as well.  Bang for the buck, running the new dual-core Opterons are
> > > *amazingly* strong performers.
> > >
> > > - Each html doc we have has 10 metatags which we store. Other than
> date,
> > > and
> > > a 10 byte string for one of the metatags, the metatags are almost
> always
> > > empty. Will this degrade performance?
> > >
> > > I would not expect this to degrade your performance.
> > >
> > > - Also when you suggest we distribute our index, on what criteria do
> we
> > > partition? It looks like we need to optimize our IO for reads which
> > means
> > > raid 5 or a solid state ram drive to me. Is this correct? Could we
> > perhaps
> > > cache it in ram (file system cache) by issuing warm up queries?
> > >
> > > The faster your disk, the better.  And yes, warm-up queries are a big
> > > help.
> > > In our instance, warm up queries need to be logically distributed to
> hit
> > > all
> > > the searchers.
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > Hi George
> > > >
> > > > Our index is currently 7 Gigs. I take it we should have more than 7
> > > > Gigs or RAM on our machine? Can we get any other hardware specs? IE
> 2,
> > > > 4 procs?
> > > >
> > > > Each html doc we have has 10 metatags which we store. Other than
> date,
> > > > and a 10 byte string for one of the metatags, the metatags are
> almost
> > > > always empty. Will this degrade performance?
> > > >
> > > > Also when you suggest we distribute our index, on what criteria do
> we
> > > > partition? It looks like we need to optimize our IO for reads which
> > > > means raid 5 or a solid state ram drive to me. Is this correct?
> Could
> > > > we perhaps cache it in ram (file system cache) by issuing warm up
> > > queries?
> > > >
> > > > BTW - we will be running on the wintel platform using c#.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > > > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > > > >
> > > > > Hi Pam,
> > > > >
> > > > > You also need to investigate your hardware configuration.  Beside
> > > > > the usual of having a fast CPU and max out your memory, make sure
> > > > > have a fast hard drive.
> > > > >
> > > > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > > > bound, thus a fast hard drive is critical.  Simply moving from
> > > > > 5400rpm to 7200rpm
> > > > will
> > > > > give you a noticeable difference -- switch to a fast SCSI/RAID
> hard
> > > > > rive and you will even see better results.  And yet even better,
> if
> > > > > you
> > > > distribute
> > > > > your index across multiple hard-drives/portions.
> > > > >
> > > > > One other thing to look for, are you storing any data in your
> Lucene
> > > > > index?
> > > > > If so, consider not doing it.  The goal is to keep the index size
> as
> > > > small
> > > > > as possible to reduce I/O.
> > > > >
> > > > > Good luck.
> > > > >
> > > > > -- George Aroush
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > > > Sent: Friday, May 19, 2006 4:28 PM
> > > > > To: lucene-net-dev@incubator.apache.org
> > > > > Subject: Re: noobie question
> > > > >
> > > > > Yes, the merge parameters does affect indexing performance, but
> > > > > compactness also affects search performance as your index gets
> > > > > larger.  As you incrementally update the index, the fragmentation
> > > > > effect (which the
> > > > merge
> > > > > properties will dictate) causes performance degradation at search
> > > time.
> > > > >
> > > > > As for index size, I don't know about any hard and fast rules.  We
> > > > > have about 7-8GB of indexes of varying structure, and those are
> > > > > spread out
> > > > over
> > > > > about 40 indexes.  We try to keep individual indexes below 300MB,
> as
> > > > > the operational hassles after that size seem to be more
> burdensome.
> > > > > We also use distributed searching so our indexes are allocated
> > > > > across multiple machines (no duplication).  As a rule, we also try
> > > > > to stay below 2.5GB of
> > > > aggregate
> > > > > indexes on one machine.  Our indexes are a full corpus and we must
> > > > search
> > > > > across all indexes all the time.  You can structure your indexes
> > > > > more effectively if you don't need to search the full corpus all
> the
> > > time.
> > > > >
> > > > > With multiple indexes being searched collectively, you'll soon be
> > > > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > > > it makes a difference in search performance (nice caching).
> > > > >
> > > > > -- j
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Jeff
> > > > > >
> > > > > > A couple more questions. Don't the merge parameters determine
> how
> > > > > > aggressively the index is compacted? And if so, doesn't this
> > > > > > affect only indexing performance and not search performance?
> > > > > >
> > > > > > Secondly how large should each index be? Should I be
> partitioning
> > > > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > > > one for January, etc? Or is it done by size?
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Pamela -
> > > > > > >
> > > > > > > Performance certainly changes as your index grows, and it's
> not
> > > > > > > even necessarily a linear progression.  How you indexed your
> > > > > > > data,
> > > > > > compression
> > > > > > > factors, compound vs. loose file format, number of indexes,
> etc.
> > > > > > > all
> > > > > > play
> > > > > > > a
> > > > > > > part in affecting search performance at runtime.
> > > > > > >
> > > > > > > There are a lot of places to look for improvements.  I would
> > > > > > > suggest looking at your specific indexes and see if you can
> > > > > > > break those up into smaller indexes -- this will lead you to
> the
> > > > > > > MultiSearcher (and, if you have multi-processor hardware,
> > > ParallelMultiSearcher).
> > > > > > >
> > > > > > > Leave your index updating operation out of the picture for the
> > > > moment.
> > > > > > > Indexing can have a big impact on search performance, so take
> > > > > > > that out
> > > > > > of
> > > > > > > the equation.  After you're able to get to better runtime
> search
> > > > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > > > you from experience that most search systems with indexes of
> > > > > > > substantial size are executing indexing operations on separate
> > > > > > > systems to avoid performance impacts.
> > > > > > >
> > > > > > > Hope this helps.
> > > > > > >
> > > > > > > -- j
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > I have been developing a C# search solution for an
> application
> > > > > > > > which
> > > > > > has
> > > > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > > > under 1
> > > > > k.
> > > > > > > >
> > > > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > > > searches are now taking 8-10 seconds.
> > > > > > > >
> > > > > > > > Where should I focus my efforts to increase search speed?
> > > > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > > > >
> > > > > > > > We only have one machine right now which serves indexing and
> > > > > > searching.
> > > > > > > >
> > > > > > > > TIA
> > > > > > > >
> > > > > > > > Pam
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

Hi Pam -

> I am confused, what do you mean by storing data in my index?
(George, correct me if I'm wrong here.)

What George is referring to is the different manners in which data can be
included in an index.  Take a look at the Field class and you'll notice a
series of static methods that store data in a number of ways.  The static
methods define four different ways to include data in an index -- Keyword,
Unindexed, Unstored, and Text.  These are just wrapper definitions for
indexing, storing and tokenizing index information.

"Indexing" means including data with a field that would be searchable.
"Storing" means including data with a field for presentation.
"Tokenizing" means using analyzed data with a field that's been designated
as indexed (searchable).

For the four static methods:
Keyword - values are indexed (searchable) and stored but not tokenized
Unindexed - values are stored but not indexed or tokenized
Unstored - values are indexed and tokenized (searchable) but not stored
Text - values are indexed, tokenized and stored

In making decisions about index composition, choose the field storage method
that best matches the need for your particular data field.  The fewer data
fields you need, the smaller the index, the better the performance.


> Thanks to you and Jeff for all of your help! I really appreciate it!
That's why the list is here.  :-)

-- j



On 5/23/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Hi George
>
> I am confused, what do you mean by storing data in my index?
>
> Thanks to you and Jeff for all of your help! I really appreciate it!
>
> Pam
>
>
> On 5/22/06, George Aroush <ge...@aroush.net> wrote:
> >
> > Hi Pam and Jeff,
> >
> > You can't load 7Gb of index into memory.  A typical Windows application
> > can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> > Lucene
> > is running chance are that you still have a lot of real memory not being
> > used.
> >
> > You need to investigate and find out why your index grew to 7Gb and
> reduce
> > it's size.  For example, are you storing any data in Lucene's index?  If
> > so,
> > consider not doing so.
> >
> > Monitor your CPU and see that it is being max'ed out or not.  Chance are
> > that it is and if queries are still taking log to run then your focus
> > should
> > be on disk I/O.
> >
> > Regards,
> >
> > -- George Aroush
> >
> >
> > -----Original Message-----
> > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > Sent: Saturday, May 20, 2006 11:18 AM
> > To: lucene-net-dev@incubator.apache.org
> > Subject: Re: noobie question
> >
> > - Our index is currently 7 Gigs. I take it we should have more than 7
> Gigs
> > or RAM on our machine? Can we get any other hardware specs? IE 2, 4
> procs?
> >
> > You can go with big RAM, but I haven't found that to be a huge boost in
> > search perf.  We run dual-proc Xeons for our search servers, as CPU has
> > been
> > the bottleneck.  Sorts are particularly egregious when it comes to CPU
> > load
> > as well.  Bang for the buck, running the new dual-core Opterons are
> > *amazingly* strong performers.
> >
> > - Each html doc we have has 10 metatags which we store. Other than date,
> > and
> > a 10 byte string for one of the metatags, the metatags are almost always
> > empty. Will this degrade performance?
> >
> > I would not expect this to degrade your performance.
> >
> > - Also when you suggest we distribute our index, on what criteria do we
> > partition? It looks like we need to optimize our IO for reads which
> means
> > raid 5 or a solid state ram drive to me. Is this correct? Could we
> perhaps
> > cache it in ram (file system cache) by issuing warm up queries?
> >
> > The faster your disk, the better.  And yes, warm-up queries are a big
> > help.
> > In our instance, warm up queries need to be logically distributed to hit
> > all
> > the searchers.
> >
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > Hi George
> > >
> > > Our index is currently 7 Gigs. I take it we should have more than 7
> > > Gigs or RAM on our machine? Can we get any other hardware specs? IE 2,
> > > 4 procs?
> > >
> > > Each html doc we have has 10 metatags which we store. Other than date,
> > > and a 10 byte string for one of the metatags, the metatags are almost
> > > always empty. Will this degrade performance?
> > >
> > > Also when you suggest we distribute our index, on what criteria do we
> > > partition? It looks like we need to optimize our IO for reads which
> > > means raid 5 or a solid state ram drive to me. Is this correct? Could
> > > we perhaps cache it in ram (file system cache) by issuing warm up
> > queries?
> > >
> > > BTW - we will be running on the wintel platform using c#.
> > >
> > > TIA
> > >
> > > Pam
> > >
> > >
> > > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > > >
> > > > Hi Pam,
> > > >
> > > > You also need to investigate your hardware configuration.  Beside
> > > > the usual of having a fast CPU and max out your memory, make sure
> > > > have a fast hard drive.
> > > >
> > > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > > bound, thus a fast hard drive is critical.  Simply moving from
> > > > 5400rpm to 7200rpm
> > > will
> > > > give you a noticeable difference -- switch to a fast SCSI/RAID hard
> > > > rive and you will even see better results.  And yet even better, if
> > > > you
> > > distribute
> > > > your index across multiple hard-drives/portions.
> > > >
> > > > One other thing to look for, are you storing any data in your Lucene
> > > > index?
> > > > If so, consider not doing it.  The goal is to keep the index size as
> > > small
> > > > as possible to reduce I/O.
> > > >
> > > > Good luck.
> > > >
> > > > -- George Aroush
> > > >
> > > > -----Original Message-----
> > > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > > Sent: Friday, May 19, 2006 4:28 PM
> > > > To: lucene-net-dev@incubator.apache.org
> > > > Subject: Re: noobie question
> > > >
> > > > Yes, the merge parameters does affect indexing performance, but
> > > > compactness also affects search performance as your index gets
> > > > larger.  As you incrementally update the index, the fragmentation
> > > > effect (which the
> > > merge
> > > > properties will dictate) causes performance degradation at search
> > time.
> > > >
> > > > As for index size, I don't know about any hard and fast rules.  We
> > > > have about 7-8GB of indexes of varying structure, and those are
> > > > spread out
> > > over
> > > > about 40 indexes.  We try to keep individual indexes below 300MB, as
> > > > the operational hassles after that size seem to be more burdensome.
> > > > We also use distributed searching so our indexes are allocated
> > > > across multiple machines (no duplication).  As a rule, we also try
> > > > to stay below 2.5GB of
> > > aggregate
> > > > indexes on one machine.  Our indexes are a full corpus and we must
> > > search
> > > > across all indexes all the time.  You can structure your indexes
> > > > more effectively if you don't need to search the full corpus all the
> > time.
> > > >
> > > > With multiple indexes being searched collectively, you'll soon be
> > > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > > it makes a difference in search performance (nice caching).
> > > >
> > > > -- j
> > > >
> > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > >
> > > > > Hi Jeff
> > > > >
> > > > > A couple more questions. Don't the merge parameters determine how
> > > > > aggressively the index is compacted? And if so, doesn't this
> > > > > affect only indexing performance and not search performance?
> > > > >
> > > > > Secondly how large should each index be? Should I be partitioning
> > > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > > one for January, etc? Or is it done by size?
> > > > >
> > > > > TIA
> > > > >
> > > > > Pam
> > > > >
> > > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Pamela -
> > > > > >
> > > > > > Performance certainly changes as your index grows, and it's not
> > > > > > even necessarily a linear progression.  How you indexed your
> > > > > > data,
> > > > > compression
> > > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > > all
> > > > > play
> > > > > > a
> > > > > > part in affecting search performance at runtime.
> > > > > >
> > > > > > There are a lot of places to look for improvements.  I would
> > > > > > suggest looking at your specific indexes and see if you can
> > > > > > break those up into smaller indexes -- this will lead you to the
> > > > > > MultiSearcher (and, if you have multi-processor hardware,
> > ParallelMultiSearcher).
> > > > > >
> > > > > > Leave your index updating operation out of the picture for the
> > > moment.
> > > > > > Indexing can have a big impact on search performance, so take
> > > > > > that out
> > > > > of
> > > > > > the equation.  After you're able to get to better runtime search
> > > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > > you from experience that most search systems with indexes of
> > > > > > substantial size are executing indexing operations on separate
> > > > > > systems to avoid performance impacts.
> > > > > >
> > > > > > Hope this helps.
> > > > > >
> > > > > > -- j
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > > >
> > > > > > > I have been developing a C# search solution for an application
> > > > > > > which
> > > > > has
> > > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > > under 1
> > > > k.
> > > > > > >
> > > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > > searches are now taking 8-10 seconds.
> > > > > > >
> > > > > > > Where should I focus my efforts to increase search speed?
> > > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > > >
> > > > > > > We only have one machine right now which serves indexing and
> > > > > searching.
> > > > > > >
> > > > > > > TIA
> > > > > > >
> > > > > > > Pam
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

Hi George

I am confused, what do you mean by storing data in my index?

Thanks to you and Jeff for all of your help! I really appreciate it!

Pam


On 5/22/06, George Aroush <ge...@aroush.net> wrote:
>
> Hi Pam and Jeff,
>
> You can't load 7Gb of index into memory.  A typical Windows application
> can't access more then 2Gb of RAM -- so if a machine has 8Gg and only
> Lucene
> is running chance are that you still have a lot of real memory not being
> used.
>
> You need to investigate and find out why your index grew to 7Gb and reduce
> it's size.  For example, are you storing any data in Lucene's index?  If
> so,
> consider not doing so.
>
> Monitor your CPU and see that it is being max'ed out or not.  Chance are
> that it is and if queries are still taking log to run then your focus
> should
> be on disk I/O.
>
> Regards,
>
> -- George Aroush
>
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Saturday, May 20, 2006 11:18 AM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> - Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
> or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?
>
> You can go with big RAM, but I haven't found that to be a huge boost in
> search perf.  We run dual-proc Xeons for our search servers, as CPU has
> been
> the bottleneck.  Sorts are particularly egregious when it comes to CPU
> load
> as well.  Bang for the buck, running the new dual-core Opterons are
> *amazingly* strong performers.
>
> - Each html doc we have has 10 metatags which we store. Other than date,
> and
> a 10 byte string for one of the metatags, the metatags are almost always
> empty. Will this degrade performance?
>
> I would not expect this to degrade your performance.
>
> - Also when you suggest we distribute our index, on what criteria do we
> partition? It looks like we need to optimize our IO for reads which means
> raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
> cache it in ram (file system cache) by issuing warm up queries?
>
> The faster your disk, the better.  And yes, warm-up queries are a big
> help.
> In our instance, warm up queries need to be logically distributed to hit
> all
> the searchers.
>
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi George
> >
> > Our index is currently 7 Gigs. I take it we should have more than 7
> > Gigs or RAM on our machine? Can we get any other hardware specs? IE 2,
> > 4 procs?
> >
> > Each html doc we have has 10 metatags which we store. Other than date,
> > and a 10 byte string for one of the metatags, the metatags are almost
> > always empty. Will this degrade performance?
> >
> > Also when you suggest we distribute our index, on what criteria do we
> > partition? It looks like we need to optimize our IO for reads which
> > means raid 5 or a solid state ram drive to me. Is this correct? Could
> > we perhaps cache it in ram (file system cache) by issuing warm up
> queries?
> >
> > BTW - we will be running on the wintel platform using c#.
> >
> > TIA
> >
> > Pam
> >
> >
> > On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> > >
> > > Hi Pam,
> > >
> > > You also need to investigate your hardware configuration.  Beside
> > > the usual of having a fast CPU and max out your memory, make sure
> > > have a fast hard drive.
> > >
> > > As a Lucene index grows, anything you do with Lucene becomes I/O
> > > bound, thus a fast hard drive is critical.  Simply moving from
> > > 5400rpm to 7200rpm
> > will
> > > give you a noticeable difference -- switch to a fast SCSI/RAID hard
> > > rive and you will even see better results.  And yet even better, if
> > > you
> > distribute
> > > your index across multiple hard-drives/portions.
> > >
> > > One other thing to look for, are you storing any data in your Lucene
> > > index?
> > > If so, consider not doing it.  The goal is to keep the index size as
> > small
> > > as possible to reduce I/O.
> > >
> > > Good luck.
> > >
> > > -- George Aroush
> > >
> > > -----Original Message-----
> > > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > > Sent: Friday, May 19, 2006 4:28 PM
> > > To: lucene-net-dev@incubator.apache.org
> > > Subject: Re: noobie question
> > >
> > > Yes, the merge parameters does affect indexing performance, but
> > > compactness also affects search performance as your index gets
> > > larger.  As you incrementally update the index, the fragmentation
> > > effect (which the
> > merge
> > > properties will dictate) causes performance degradation at search
> time.
> > >
> > > As for index size, I don't know about any hard and fast rules.  We
> > > have about 7-8GB of indexes of varying structure, and those are
> > > spread out
> > over
> > > about 40 indexes.  We try to keep individual indexes below 300MB, as
> > > the operational hassles after that size seem to be more burdensome.
> > > We also use distributed searching so our indexes are allocated
> > > across multiple machines (no duplication).  As a rule, we also try
> > > to stay below 2.5GB of
> > aggregate
> > > indexes on one machine.  Our indexes are a full corpus and we must
> > search
> > > across all indexes all the time.  You can structure your indexes
> > > more effectively if you don't need to search the full corpus all the
> time.
> > >
> > > With multiple indexes being searched collectively, you'll soon be
> > > using the MultiSearcher class.  Be sure to look at MultiReader, as
> > > it makes a difference in search performance (nice caching).
> > >
> > > -- j
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > Hi Jeff
> > > >
> > > > A couple more questions. Don't the merge parameters determine how
> > > > aggressively the index is compacted? And if so, doesn't this
> > > > affect only indexing performance and not search performance?
> > > >
> > > > Secondly how large should each index be? Should I be partitioning
> > > > the indexes, ie by date range? So one index for Decemeber 2005,
> > > > one for January, etc? Or is it done by size?
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > > >
> > > > > Hi Pamela -
> > > > >
> > > > > Performance certainly changes as your index grows, and it's not
> > > > > even necessarily a linear progression.  How you indexed your
> > > > > data,
> > > > compression
> > > > > factors, compound vs. loose file format, number of indexes, etc.
> > > > > all
> > > > play
> > > > > a
> > > > > part in affecting search performance at runtime.
> > > > >
> > > > > There are a lot of places to look for improvements.  I would
> > > > > suggest looking at your specific indexes and see if you can
> > > > > break those up into smaller indexes -- this will lead you to the
> > > > > MultiSearcher (and, if you have multi-processor hardware,
> ParallelMultiSearcher).
> > > > >
> > > > > Leave your index updating operation out of the picture for the
> > moment.
> > > > > Indexing can have a big impact on search performance, so take
> > > > > that out
> > > > of
> > > > > the equation.  After you're able to get to better runtime search
> > > > > performance, go back and add indexing to the mix.  I can tell
> > > > > you from experience that most search systems with indexes of
> > > > > substantial size are executing indexing operations on separate
> > > > > systems to avoid performance impacts.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > -- j
> > > > >
> > > > >
> > > > >
> > > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > > >
> > > > > > I have been developing a C# search solution for an application
> > > > > > which
> > > > has
> > > > > > tens of millions of web pages. Most of these web pages are
> > > > > > under 1
> > > k.
> > > > > >
> > > > > > While our initial pilot was very encouraging on our tests of
> > > > > > 1,000,000 docs, when we scaled up to 10 million subsecond
> > > > > > searches are now taking 8-10 seconds.
> > > > > >
> > > > > > Where should I focus my efforts to increase search speed?
> > > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > > >
> > > > > > We only have one machine right now which serves indexing and
> > > > searching.
> > > > > >
> > > > > > TIA
> > > > > >
> > > > > > Pam
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

RE: noobie question

Posted by George Aroush <ge...@aroush.net>.

Hi Pam and Jeff,

You can't load 7Gb of index into memory.  A typical Windows application
can't access more then 2Gb of RAM -- so if a machine has 8Gg and only Lucene
is running chance are that you still have a lot of real memory not being
used.

You need to investigate and find out why your index grew to 7Gb and reduce
it's size.  For example, are you storing any data in Lucene's index?  If so,
consider not doing so.

Monitor your CPU and see that it is being max'ed out or not.  Chance are
that it is and if queries are still taking log to run then your focus should
be on disk I/O.

Regards,

-- George Aroush


-----Original Message-----
From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com] 
Sent: Saturday, May 20, 2006 11:18 AM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

- Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?

You can go with big RAM, but I haven't found that to be a huge boost in
search perf.  We run dual-proc Xeons for our search servers, as CPU has been
the bottleneck.  Sorts are particularly egregious when it comes to CPU load
as well.  Bang for the buck, running the new dual-core Opterons are
*amazingly* strong performers.

- Each html doc we have has 10 metatags which we store. Other than date, and
a 10 byte string for one of the metatags, the metatags are almost always
empty. Will this degrade performance?

I would not expect this to degrade your performance.

- Also when you suggest we distribute our index, on what criteria do we
partition? It looks like we need to optimize our IO for reads which means
raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
cache it in ram (file system cache) by issuing warm up queries?

The faster your disk, the better.  And yes, warm-up queries are a big help.
In our instance, warm up queries need to be logically distributed to hit all
the searchers.


On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Hi George
>
> Our index is currently 7 Gigs. I take it we should have more than 7 
> Gigs or RAM on our machine? Can we get any other hardware specs? IE 2, 
> 4 procs?
>
> Each html doc we have has 10 metatags which we store. Other than date, 
> and a 10 byte string for one of the metatags, the metatags are almost 
> always empty. Will this degrade performance?
>
> Also when you suggest we distribute our index, on what criteria do we 
> partition? It looks like we need to optimize our IO for reads which 
> means raid 5 or a solid state ram drive to me. Is this correct? Could 
> we perhaps cache it in ram (file system cache) by issuing warm up queries?
>
> BTW - we will be running on the wintel platform using c#.
>
> TIA
>
> Pam
>
>
> On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> >
> > Hi Pam,
> >
> > You also need to investigate your hardware configuration.  Beside 
> > the usual of having a fast CPU and max out your memory, make sure 
> > have a fast hard drive.
> >
> > As a Lucene index grows, anything you do with Lucene becomes I/O 
> > bound, thus a fast hard drive is critical.  Simply moving from 
> > 5400rpm to 7200rpm
> will
> > give you a noticeable difference -- switch to a fast SCSI/RAID hard 
> > rive and you will even see better results.  And yet even better, if 
> > you
> distribute
> > your index across multiple hard-drives/portions.
> >
> > One other thing to look for, are you storing any data in your Lucene 
> > index?
> > If so, consider not doing it.  The goal is to keep the index size as
> small
> > as possible to reduce I/O.
> >
> > Good luck.
> >
> > -- George Aroush
> >
> > -----Original Message-----
> > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > Sent: Friday, May 19, 2006 4:28 PM
> > To: lucene-net-dev@incubator.apache.org
> > Subject: Re: noobie question
> >
> > Yes, the merge parameters does affect indexing performance, but 
> > compactness also affects search performance as your index gets 
> > larger.  As you incrementally update the index, the fragmentation 
> > effect (which the
> merge
> > properties will dictate) causes performance degradation at search time.
> >
> > As for index size, I don't know about any hard and fast rules.  We 
> > have about 7-8GB of indexes of varying structure, and those are 
> > spread out
> over
> > about 40 indexes.  We try to keep individual indexes below 300MB, as 
> > the operational hassles after that size seem to be more burdensome.  
> > We also use distributed searching so our indexes are allocated 
> > across multiple machines (no duplication).  As a rule, we also try 
> > to stay below 2.5GB of
> aggregate
> > indexes on one machine.  Our indexes are a full corpus and we must
> search
> > across all indexes all the time.  You can structure your indexes 
> > more effectively if you don't need to search the full corpus all the
time.
> >
> > With multiple indexes being searched collectively, you'll soon be 
> > using the MultiSearcher class.  Be sure to look at MultiReader, as 
> > it makes a difference in search performance (nice caching).
> >
> > -- j
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > Hi Jeff
> > >
> > > A couple more questions. Don't the merge parameters determine how 
> > > aggressively the index is compacted? And if so, doesn't this 
> > > affect only indexing performance and not search performance?
> > >
> > > Secondly how large should each index be? Should I be partitioning 
> > > the indexes, ie by date range? So one index for Decemeber 2005, 
> > > one for January, etc? Or is it done by size?
> > >
> > > TIA
> > >
> > > Pam
> > >
> > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > >
> > > > Hi Pamela -
> > > >
> > > > Performance certainly changes as your index grows, and it's not 
> > > > even necessarily a linear progression.  How you indexed your 
> > > > data,
> > > compression
> > > > factors, compound vs. loose file format, number of indexes, etc. 
> > > > all
> > > play
> > > > a
> > > > part in affecting search performance at runtime.
> > > >
> > > > There are a lot of places to look for improvements.  I would 
> > > > suggest looking at your specific indexes and see if you can 
> > > > break those up into smaller indexes -- this will lead you to the 
> > > > MultiSearcher (and, if you have multi-processor hardware,
ParallelMultiSearcher).
> > > >
> > > > Leave your index updating operation out of the picture for the
> moment.
> > > > Indexing can have a big impact on search performance, so take 
> > > > that out
> > > of
> > > > the equation.  After you're able to get to better runtime search 
> > > > performance, go back and add indexing to the mix.  I can tell 
> > > > you from experience that most search systems with indexes of 
> > > > substantial size are executing indexing operations on separate 
> > > > systems to avoid performance impacts.
> > > >
> > > > Hope this helps.
> > > >
> > > > -- j
> > > >
> > > >
> > > >
> > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > >
> > > > > I have been developing a C# search solution for an application 
> > > > > which
> > > has
> > > > > tens of millions of web pages. Most of these web pages are 
> > > > > under 1
> > k.
> > > > >
> > > > > While our initial pilot was very encouraging on our tests of 
> > > > > 1,000,000 docs, when we scaled up to 10 million subsecond 
> > > > > searches are now taking 8-10 seconds.
> > > > >
> > > > > Where should I focus my efforts to increase search speed? 
> > > > > Should I be using the RAMDirectory? MultiSearcher?
> > > > >
> > > > > We only have one machine right now which serves indexing and
> > > searching.
> > > > >
> > > > > TIA
> > > > >
> > > > > Pam
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

- Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
or RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?

You can go with big RAM, but I haven't found that to be a huge boost in
search perf.  We run dual-proc Xeons for our search servers, as CPU has been
the bottleneck.  Sorts are particularly egregious when it comes to CPU load
as well.  Bang for the buck, running the new dual-core Opterons are
*amazingly* strong performers.

- Each html doc we have has 10 metatags which we store. Other than date, and
a 10 byte string for one of the metatags, the metatags are almost always
empty. Will this degrade performance?

I would not expect this to degrade your performance.

- Also when you suggest we distribute our index, on what criteria do we
partition? It looks like we need to optimize our IO for reads which means
raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
cache it in ram (file system cache) by issuing warm up queries?

The faster your disk, the better.  And yes, warm-up queries are a big help.
In our instance, warm up queries need to be logically distributed to hit all
the searchers.


On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Hi George
>
> Our index is currently 7 Gigs. I take it we should have more than 7 Gigs
> or
> RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?
>
> Each html doc we have has 10 metatags which we store. Other than date, and
> a
> 10 byte string for one of the metatags, the metatags are almost always
> empty. Will this degrade performance?
>
> Also when you suggest we distribute our index, on what criteria do we
> partition? It looks like we need to optimize our IO for reads which means
> raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
> cache it in ram (file system cache) by issuing warm up queries?
>
> BTW - we will be running on the wintel platform using c#.
>
> TIA
>
> Pam
>
>
> On 5/19/06, George Aroush <ge...@aroush.net> wrote:
> >
> > Hi Pam,
> >
> > You also need to investigate your hardware configuration.  Beside the
> > usual
> > of having a fast CPU and max out your memory, make sure have a fast hard
> > drive.
> >
> > As a Lucene index grows, anything you do with Lucene becomes I/O bound,
> > thus
> > a fast hard drive is critical.  Simply moving from 5400rpm to 7200rpm
> will
> > give you a noticeable difference -- switch to a fast SCSI/RAID hard rive
> > and
> > you will even see better results.  And yet even better, if you
> distribute
> > your index across multiple hard-drives/portions.
> >
> > One other thing to look for, are you storing any data in your Lucene
> > index?
> > If so, consider not doing it.  The goal is to keep the index size as
> small
> > as possible to reduce I/O.
> >
> > Good luck.
> >
> > -- George Aroush
> >
> > -----Original Message-----
> > From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> > Sent: Friday, May 19, 2006 4:28 PM
> > To: lucene-net-dev@incubator.apache.org
> > Subject: Re: noobie question
> >
> > Yes, the merge parameters does affect indexing performance, but
> > compactness
> > also affects search performance as your index gets larger.  As you
> > incrementally update the index, the fragmentation effect (which the
> merge
> > properties will dictate) causes performance degradation at search time.
> >
> > As for index size, I don't know about any hard and fast rules.  We have
> > about 7-8GB of indexes of varying structure, and those are spread out
> over
> > about 40 indexes.  We try to keep individual indexes below 300MB, as the
> > operational hassles after that size seem to be more burdensome.  We also
> > use
> > distributed searching so our indexes are allocated across multiple
> > machines
> > (no duplication).  As a rule, we also try to stay below 2.5GB of
> aggregate
> > indexes on one machine.  Our indexes are a full corpus and we must
> search
> > across all indexes all the time.  You can structure your indexes more
> > effectively if you don't need to search the full corpus all the time.
> >
> > With multiple indexes being searched collectively, you'll soon be using
> > the
> > MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> > difference in search performance (nice caching).
> >
> > -- j
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > Hi Jeff
> > >
> > > A couple more questions. Don't the merge parameters determine how
> > > aggressively the index is compacted? And if so, doesn't this affect
> > > only indexing performance and not search performance?
> > >
> > > Secondly how large should each index be? Should I be partitioning the
> > > indexes, ie by date range? So one index for Decemeber 2005, one for
> > > January, etc? Or is it done by size?
> > >
> > > TIA
> > >
> > > Pam
> > >
> > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > >
> > > > Hi Pamela -
> > > >
> > > > Performance certainly changes as your index grows, and it's not even
> > > > necessarily a linear progression.  How you indexed your data,
> > > compression
> > > > factors, compound vs. loose file format, number of indexes, etc. all
> > > play
> > > > a
> > > > part in affecting search performance at runtime.
> > > >
> > > > There are a lot of places to look for improvements.  I would suggest
> > > > looking at your specific indexes and see if you can break those up
> > > > into smaller indexes -- this will lead you to the MultiSearcher
> > > > (and, if you have multi-processor hardware, ParallelMultiSearcher).
> > > >
> > > > Leave your index updating operation out of the picture for the
> moment.
> > > > Indexing can have a big impact on search performance, so take that
> > > > out
> > > of
> > > > the equation.  After you're able to get to better runtime search
> > > > performance, go back and add indexing to the mix.  I can tell you
> > > > from experience that most search systems with indexes of substantial
> > > > size are executing indexing operations on separate systems to avoid
> > > > performance impacts.
> > > >
> > > > Hope this helps.
> > > >
> > > > -- j
> > > >
> > > >
> > > >
> > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > >
> > > > > I have been developing a C# search solution for an application
> > > > > which
> > > has
> > > > > tens of millions of web pages. Most of these web pages are under 1
> > k.
> > > > >
> > > > > While our initial pilot was very encouraging on our tests of
> > > > > 1,000,000 docs, when we scaled up to 10 million subsecond searches
> > > > > are now taking 8-10 seconds.
> > > > >
> > > > > Where should I focus my efforts to increase search speed? Should I
> > > > > be using the RAMDirectory? MultiSearcher?
> > > > >
> > > > > We only have one machine right now which serves indexing and
> > > searching.
> > > > >
> > > > > TIA
> > > > >
> > > > > Pam
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

Hi George

Our index is currently 7 Gigs. I take it we should have more than 7 Gigs or
RAM on our machine? Can we get any other hardware specs? IE 2, 4 procs?

Each html doc we have has 10 metatags which we store. Other than date, and a
10 byte string for one of the metatags, the metatags are almost always
empty. Will this degrade performance?

Also when you suggest we distribute our index, on what criteria do we
partition? It looks like we need to optimize our IO for reads which means
raid 5 or a solid state ram drive to me. Is this correct? Could we perhaps
cache it in ram (file system cache) by issuing warm up queries?

BTW - we will be running on the wintel platform using c#.

TIA

Pam


On 5/19/06, George Aroush <ge...@aroush.net> wrote:
>
> Hi Pam,
>
> You also need to investigate your hardware configuration.  Beside the
> usual
> of having a fast CPU and max out your memory, make sure have a fast hard
> drive.
>
> As a Lucene index grows, anything you do with Lucene becomes I/O bound,
> thus
> a fast hard drive is critical.  Simply moving from 5400rpm to 7200rpm will
> give you a noticeable difference -- switch to a fast SCSI/RAID hard rive
> and
> you will even see better results.  And yet even better, if you distribute
> your index across multiple hard-drives/portions.
>
> One other thing to look for, are you storing any data in your Lucene
> index?
> If so, consider not doing it.  The goal is to keep the index size as small
> as possible to reduce I/O.
>
> Good luck.
>
> -- George Aroush
>
> -----Original Message-----
> From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com]
> Sent: Friday, May 19, 2006 4:28 PM
> To: lucene-net-dev@incubator.apache.org
> Subject: Re: noobie question
>
> Yes, the merge parameters does affect indexing performance, but
> compactness
> also affects search performance as your index gets larger.  As you
> incrementally update the index, the fragmentation effect (which the merge
> properties will dictate) causes performance degradation at search time.
>
> As for index size, I don't know about any hard and fast rules.  We have
> about 7-8GB of indexes of varying structure, and those are spread out over
> about 40 indexes.  We try to keep individual indexes below 300MB, as the
> operational hassles after that size seem to be more burdensome.  We also
> use
> distributed searching so our indexes are allocated across multiple
> machines
> (no duplication).  As a rule, we also try to stay below 2.5GB of aggregate
> indexes on one machine.  Our indexes are a full corpus and we must search
> across all indexes all the time.  You can structure your indexes more
> effectively if you don't need to search the full corpus all the time.
>
> With multiple indexes being searched collectively, you'll soon be using
> the
> MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> difference in search performance (nice caching).
>
> -- j
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi Jeff
> >
> > A couple more questions. Don't the merge parameters determine how
> > aggressively the index is compacted? And if so, doesn't this affect
> > only indexing performance and not search performance?
> >
> > Secondly how large should each index be? Should I be partitioning the
> > indexes, ie by date range? So one index for Decemeber 2005, one for
> > January, etc? Or is it done by size?
> >
> > TIA
> >
> > Pam
> >
> > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > >
> > > Hi Pamela -
> > >
> > > Performance certainly changes as your index grows, and it's not even
> > > necessarily a linear progression.  How you indexed your data,
> > compression
> > > factors, compound vs. loose file format, number of indexes, etc. all
> > play
> > > a
> > > part in affecting search performance at runtime.
> > >
> > > There are a lot of places to look for improvements.  I would suggest
> > > looking at your specific indexes and see if you can break those up
> > > into smaller indexes -- this will lead you to the MultiSearcher
> > > (and, if you have multi-processor hardware, ParallelMultiSearcher).
> > >
> > > Leave your index updating operation out of the picture for the moment.
> > > Indexing can have a big impact on search performance, so take that
> > > out
> > of
> > > the equation.  After you're able to get to better runtime search
> > > performance, go back and add indexing to the mix.  I can tell you
> > > from experience that most search systems with indexes of substantial
> > > size are executing indexing operations on separate systems to avoid
> > > performance impacts.
> > >
> > > Hope this helps.
> > >
> > > -- j
> > >
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > I have been developing a C# search solution for an application
> > > > which
> > has
> > > > tens of millions of web pages. Most of these web pages are under 1
> k.
> > > >
> > > > While our initial pilot was very encouraging on our tests of
> > > > 1,000,000 docs, when we scaled up to 10 million subsecond searches
> > > > are now taking 8-10 seconds.
> > > >
> > > > Where should I focus my efforts to increase search speed? Should I
> > > > be using the RAMDirectory? MultiSearcher?
> > > >
> > > > We only have one machine right now which serves indexing and
> > searching.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

Correct on our configuration, give or take a few 100 MB.  :-)
And we have three servers accessed simultaneously for each search.

For our index, we're dealing with information that's geographically defined,
so our indexes are broken up along those lines.  We still monitor each index
for size, but the geographic data drives our index maintenance logic.  We've
indexed approximately 20 MM rows of information.

Our partitioning criteria serves two purposes: query efficiency and index
maintainability.  Depending on how your index is structured (the Lucene
settings + your own document structure), these two can compete with each
other to the point of being polar.  Generally you'll want to find a happy
medium between the two.  While we have many rows of data and our index
documents contain quite a few fields of data, many of them are simple data
fields that aren't large (database is the data source).  By contrast, if we
were indexing full-on text documents, I'm sure our index would be
substantially larger and we'd likely take a different approach.

I did a lot of research prior to constructing our index, and with as much
feedback and data that I could glean, trial-and-error proved to be the most
effective manner in determining what to do and how to do it.

-- j


On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> OK, I'm very confused here Jeff. It sound like what you are suggesting is
> that you have multiple indexes per machine, each around 300 Mbyes, which
> means about 2.5/.3 = 8 indexes per machine, and you have 7.5/2.5 =3
> machines
> in the mix. Is this correct?
>
> On what criteria do you partition your index? Date, or some other
> criteria,
> or is it merely size?
>
> I think we have indexed 1 million rows and our index is 7 Gigs.
>
> Pam
>
>
> On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> >
> > Yes, the merge parameters does affect indexing performance, but
> > compactness
> > also affects search performance as your index gets larger.  As you
> > incrementally update the index, the fragmentation effect (which the
> merge
> > properties will dictate) causes performance degradation at search time.
> >
> > As for index size, I don't know about any hard and fast rules.  We have
> > about 7-8GB of indexes of varying structure, and those are spread out
> over
> > about 40 indexes.  We try to keep individual indexes below 300MB, as the
> > operational hassles after that size seem to be more burdensome.  We also
> > use
> > distributed searching so our indexes are allocated across multiple
> > machines
> > (no duplication).  As a rule, we also try to stay below 2.5GB of
> aggregate
> > indexes on one machine.  Our indexes are a full corpus and we must
> search
> > across all indexes all the time.  You can structure your indexes more
> > effectively if you don't need to search the full corpus all the time.
> >
> > With multiple indexes being searched collectively, you'll soon be using
> > the
> > MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> > difference in search performance (nice caching).
> >
> > -- j
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > Hi Jeff
> > >
> > > A couple more questions. Don't the merge parameters determine how
> > > aggressively the index is compacted? And if so, doesn't this affect
> only
> > > indexing performance and not search performance?
> > >
> > > Secondly how large should each index be? Should I be partitioning the
> > > indexes, ie by date range? So one index for Decemeber 2005, one for
> > > January,
> > > etc? Or is it done by size?
> > >
> > > TIA
> > >
> > > Pam
> > >
> > > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > > >
> > > > Hi Pamela -
> > > >
> > > > Performance certainly changes as your index grows, and it's not even
> > > > necessarily a linear progression.  How you indexed your data,
> > > compression
> > > > factors, compound vs. loose file format, number of indexes, etc. all
> > > play
> > > > a
> > > > part in affecting search performance at runtime.
> > > >
> > > > There are a lot of places to look for improvements.  I would suggest
> > > > looking
> > > > at your specific indexes and see if you can break those up into
> > smaller
> > > > indexes -- this will lead you to the MultiSearcher (and, if you have
> > > > multi-processor hardware, ParallelMultiSearcher).
> > > >
> > > > Leave your index updating operation out of the picture for the
> moment.
> > > > Indexing can have a big impact on search performance, so take that
> out
> > > of
> > > > the equation.  After you're able to get to better runtime search
> > > > performance, go back and add indexing to the mix.  I can tell you
> from
> > > > experience that most search systems with indexes of substantial size
> > are
> > > > executing indexing operations on separate systems to avoid
> performance
> > > > impacts.
> > > >
> > > > Hope this helps.
> > > >
> > > > -- j
> > > >
> > > >
> > > >
> > > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > > >
> > > > > I have been developing a C# search solution for an application
> which
> > > has
> > > > > tens of millions of web pages. Most of these web pages are under 1
> > k.
> > > > >
> > > > > While our initial pilot was very encouraging on our tests of
> > 1,000,000
> > > > > docs,
> > > > > when we scaled up to 10 million subsecond searches are now taking
> > 8-10
> > > > > seconds.
> > > > >
> > > > > Where should I focus my efforts to increase search speed? Should I
> > be
> > > > > using
> > > > > the RAMDirectory? MultiSearcher?
> > > > >
> > > > > We only have one machine right now which serves indexing and
> > > searching.
> > > > >
> > > > > TIA
> > > > >
> > > > > Pam
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

OK, I'm very confused here Jeff. It sound like what you are suggesting is
that you have multiple indexes per machine, each around 300 Mbyes, which
means about 2.5/.3 = 8 indexes per machine, and you have 7.5/2.5 =3 machines
in the mix. Is this correct?

On what criteria do you partition your index? Date, or some other criteria,
or is it merely size?

I think we have indexed 1 million rows and our index is 7 Gigs.

Pam


On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
>
> Yes, the merge parameters does affect indexing performance, but
> compactness
> also affects search performance as your index gets larger.  As you
> incrementally update the index, the fragmentation effect (which the merge
> properties will dictate) causes performance degradation at search time.
>
> As for index size, I don't know about any hard and fast rules.  We have
> about 7-8GB of indexes of varying structure, and those are spread out over
> about 40 indexes.  We try to keep individual indexes below 300MB, as the
> operational hassles after that size seem to be more burdensome.  We also
> use
> distributed searching so our indexes are allocated across multiple
> machines
> (no duplication).  As a rule, we also try to stay below 2.5GB of aggregate
> indexes on one machine.  Our indexes are a full corpus and we must search
> across all indexes all the time.  You can structure your indexes more
> effectively if you don't need to search the full corpus all the time.
>
> With multiple indexes being searched collectively, you'll soon be using
> the
> MultiSearcher class.  Be sure to look at MultiReader, as it makes a
> difference in search performance (nice caching).
>
> -- j
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > Hi Jeff
> >
> > A couple more questions. Don't the merge parameters determine how
> > aggressively the index is compacted? And if so, doesn't this affect only
> > indexing performance and not search performance?
> >
> > Secondly how large should each index be? Should I be partitioning the
> > indexes, ie by date range? So one index for Decemeber 2005, one for
> > January,
> > etc? Or is it done by size?
> >
> > TIA
> >
> > Pam
> >
> > On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> > >
> > > Hi Pamela -
> > >
> > > Performance certainly changes as your index grows, and it's not even
> > > necessarily a linear progression.  How you indexed your data,
> > compression
> > > factors, compound vs. loose file format, number of indexes, etc. all
> > play
> > > a
> > > part in affecting search performance at runtime.
> > >
> > > There are a lot of places to look for improvements.  I would suggest
> > > looking
> > > at your specific indexes and see if you can break those up into
> smaller
> > > indexes -- this will lead you to the MultiSearcher (and, if you have
> > > multi-processor hardware, ParallelMultiSearcher).
> > >
> > > Leave your index updating operation out of the picture for the moment.
> > > Indexing can have a big impact on search performance, so take that out
> > of
> > > the equation.  After you're able to get to better runtime search
> > > performance, go back and add indexing to the mix.  I can tell you from
> > > experience that most search systems with indexes of substantial size
> are
> > > executing indexing operations on separate systems to avoid performance
> > > impacts.
> > >
> > > Hope this helps.
> > >
> > > -- j
> > >
> > >
> > >
> > > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > > >
> > > > I have been developing a C# search solution for an application which
> > has
> > > > tens of millions of web pages. Most of these web pages are under 1
> k.
> > > >
> > > > While our initial pilot was very encouraging on our tests of
> 1,000,000
> > > > docs,
> > > > when we scaled up to 10 million subsecond searches are now taking
> 8-10
> > > > seconds.
> > > >
> > > > Where should I focus my efforts to increase search speed? Should I
> be
> > > > using
> > > > the RAMDirectory? MultiSearcher?
> > > >
> > > > We only have one machine right now which serves indexing and
> > searching.
> > > >
> > > > TIA
> > > >
> > > > Pam
> > > >
> > > >
> > >
> > >
> >
> >
>
>

RE: noobie question

Posted by George Aroush <ge...@aroush.net>.

Hi Pam,

You also need to investigate your hardware configuration.  Beside the usual
of having a fast CPU and max out your memory, make sure have a fast hard
drive.

As a Lucene index grows, anything you do with Lucene becomes I/O bound, thus
a fast hard drive is critical.  Simply moving from 5400rpm to 7200rpm will
give you a noticeable difference -- switch to a fast SCSI/RAID hard rive and
you will even see better results.  And yet even better, if you distribute
your index across multiple hard-drives/portions.

One other thing to look for, are you storing any data in your Lucene index?
If so, consider not doing it.  The goal is to keep the index size as small
as possible to reduce I/O.

Good luck.

-- George Aroush

-----Original Message-----
From: Jeff Rodenburg [mailto:jeff.rodenburg@gmail.com] 
Sent: Friday, May 19, 2006 4:28 PM
To: lucene-net-dev@incubator.apache.org
Subject: Re: noobie question

Yes, the merge parameters does affect indexing performance, but compactness
also affects search performance as your index gets larger.  As you
incrementally update the index, the fragmentation effect (which the merge
properties will dictate) causes performance degradation at search time.

As for index size, I don't know about any hard and fast rules.  We have
about 7-8GB of indexes of varying structure, and those are spread out over
about 40 indexes.  We try to keep individual indexes below 300MB, as the
operational hassles after that size seem to be more burdensome.  We also use
distributed searching so our indexes are allocated across multiple machines
(no duplication).  As a rule, we also try to stay below 2.5GB of aggregate
indexes on one machine.  Our indexes are a full corpus and we must search
across all indexes all the time.  You can structure your indexes more
effectively if you don't need to search the full corpus all the time.

With multiple indexes being searched collectively, you'll soon be using the
MultiSearcher class.  Be sure to look at MultiReader, as it makes a
difference in search performance (nice caching).

-- j

On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Hi Jeff
>
> A couple more questions. Don't the merge parameters determine how 
> aggressively the index is compacted? And if so, doesn't this affect 
> only indexing performance and not search performance?
>
> Secondly how large should each index be? Should I be partitioning the 
> indexes, ie by date range? So one index for Decemeber 2005, one for 
> January, etc? Or is it done by size?
>
> TIA
>
> Pam
>
> On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> >
> > Hi Pamela -
> >
> > Performance certainly changes as your index grows, and it's not even 
> > necessarily a linear progression.  How you indexed your data,
> compression
> > factors, compound vs. loose file format, number of indexes, etc. all
> play
> > a
> > part in affecting search performance at runtime.
> >
> > There are a lot of places to look for improvements.  I would suggest 
> > looking at your specific indexes and see if you can break those up 
> > into smaller indexes -- this will lead you to the MultiSearcher 
> > (and, if you have multi-processor hardware, ParallelMultiSearcher).
> >
> > Leave your index updating operation out of the picture for the moment.
> > Indexing can have a big impact on search performance, so take that 
> > out
> of
> > the equation.  After you're able to get to better runtime search 
> > performance, go back and add indexing to the mix.  I can tell you 
> > from experience that most search systems with indexes of substantial 
> > size are executing indexing operations on separate systems to avoid 
> > performance impacts.
> >
> > Hope this helps.
> >
> > -- j
> >
> >
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > I have been developing a C# search solution for an application 
> > > which
> has
> > > tens of millions of web pages. Most of these web pages are under 1 k.
> > >
> > > While our initial pilot was very encouraging on our tests of 
> > > 1,000,000 docs, when we scaled up to 10 million subsecond searches 
> > > are now taking 8-10 seconds.
> > >
> > > Where should I focus my efforts to increase search speed? Should I 
> > > be using the RAMDirectory? MultiSearcher?
> > >
> > > We only have one machine right now which serves indexing and
> searching.
> > >
> > > TIA
> > >
> > > Pam
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

Yes, the merge parameters does affect indexing performance, but compactness
also affects search performance as your index gets larger.  As you
incrementally update the index, the fragmentation effect (which the merge
properties will dictate) causes performance degradation at search time.

As for index size, I don't know about any hard and fast rules.  We have
about 7-8GB of indexes of varying structure, and those are spread out over
about 40 indexes.  We try to keep individual indexes below 300MB, as the
operational hassles after that size seem to be more burdensome.  We also use
distributed searching so our indexes are allocated across multiple machines
(no duplication).  As a rule, we also try to stay below 2.5GB of aggregate
indexes on one machine.  Our indexes are a full corpus and we must search
across all indexes all the time.  You can structure your indexes more
effectively if you don't need to search the full corpus all the time.

With multiple indexes being searched collectively, you'll soon be using the
MultiSearcher class.  Be sure to look at MultiReader, as it makes a
difference in search performance (nice caching).

-- j

On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> Hi Jeff
>
> A couple more questions. Don't the merge parameters determine how
> aggressively the index is compacted? And if so, doesn't this affect only
> indexing performance and not search performance?
>
> Secondly how large should each index be? Should I be partitioning the
> indexes, ie by date range? So one index for Decemeber 2005, one for
> January,
> etc? Or is it done by size?
>
> TIA
>
> Pam
>
> On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
> >
> > Hi Pamela -
> >
> > Performance certainly changes as your index grows, and it's not even
> > necessarily a linear progression.  How you indexed your data,
> compression
> > factors, compound vs. loose file format, number of indexes, etc. all
> play
> > a
> > part in affecting search performance at runtime.
> >
> > There are a lot of places to look for improvements.  I would suggest
> > looking
> > at your specific indexes and see if you can break those up into smaller
> > indexes -- this will lead you to the MultiSearcher (and, if you have
> > multi-processor hardware, ParallelMultiSearcher).
> >
> > Leave your index updating operation out of the picture for the moment.
> > Indexing can have a big impact on search performance, so take that out
> of
> > the equation.  After you're able to get to better runtime search
> > performance, go back and add indexing to the mix.  I can tell you from
> > experience that most search systems with indexes of substantial size are
> > executing indexing operations on separate systems to avoid performance
> > impacts.
> >
> > Hope this helps.
> >
> > -- j
> >
> >
> >
> > On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> > >
> > > I have been developing a C# search solution for an application which
> has
> > > tens of millions of web pages. Most of these web pages are under 1 k.
> > >
> > > While our initial pilot was very encouraging on our tests of 1,000,000
> > > docs,
> > > when we scaled up to 10 million subsecond searches are now taking 8-10
> > > seconds.
> > >
> > > Where should I focus my efforts to increase search speed? Should I be
> > > using
> > > the RAMDirectory? MultiSearcher?
> > >
> > > We only have one machine right now which serves indexing and
> searching.
> > >
> > > TIA
> > >
> > > Pam
> > >
> > >
> >
> >
>
>

Re: noobie question

Posted by Pamela Foxcroft <pa...@gmail.com>.

Hi Jeff

A couple more questions. Don't the merge parameters determine how
aggressively the index is compacted? And if so, doesn't this affect only
indexing performance and not search performance?

Secondly how large should each index be? Should I be partitioning the
indexes, ie by date range? So one index for Decemeber 2005, one for January,
etc? Or is it done by size?

TIA

Pam

On 5/19/06, Jeff Rodenburg <je...@gmail.com> wrote:
>
> Hi Pamela -
>
> Performance certainly changes as your index grows, and it's not even
> necessarily a linear progression.  How you indexed your data, compression
> factors, compound vs. loose file format, number of indexes, etc. all play
> a
> part in affecting search performance at runtime.
>
> There are a lot of places to look for improvements.  I would suggest
> looking
> at your specific indexes and see if you can break those up into smaller
> indexes -- this will lead you to the MultiSearcher (and, if you have
> multi-processor hardware, ParallelMultiSearcher).
>
> Leave your index updating operation out of the picture for the moment.
> Indexing can have a big impact on search performance, so take that out of
> the equation.  After you're able to get to better runtime search
> performance, go back and add indexing to the mix.  I can tell you from
> experience that most search systems with indexes of substantial size are
> executing indexing operations on separate systems to avoid performance
> impacts.
>
> Hope this helps.
>
> -- j
>
>
>
> On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
> >
> > I have been developing a C# search solution for an application which has
> > tens of millions of web pages. Most of these web pages are under 1 k.
> >
> > While our initial pilot was very encouraging on our tests of 1,000,000
> > docs,
> > when we scaled up to 10 million subsecond searches are now taking 8-10
> > seconds.
> >
> > Where should I focus my efforts to increase search speed? Should I be
> > using
> > the RAMDirectory? MultiSearcher?
> >
> > We only have one machine right now which serves indexing and searching.
> >
> > TIA
> >
> > Pam
> >
> >
>
>

Re: noobie question

Posted by Jeff Rodenburg <je...@gmail.com>.

Hi Pamela -

Performance certainly changes as your index grows, and it's not even
necessarily a linear progression.  How you indexed your data, compression
factors, compound vs. loose file format, number of indexes, etc. all play a
part in affecting search performance at runtime.

There are a lot of places to look for improvements.  I would suggest looking
at your specific indexes and see if you can break those up into smaller
indexes -- this will lead you to the MultiSearcher (and, if you have
multi-processor hardware, ParallelMultiSearcher).

Leave your index updating operation out of the picture for the moment.
Indexing can have a big impact on search performance, so take that out of
the equation.  After you're able to get to better runtime search
performance, go back and add indexing to the mix.  I can tell you from
experience that most search systems with indexes of substantial size are
executing indexing operations on separate systems to avoid performance
impacts.

Hope this helps.

-- j

On 5/19/06, Pamela Foxcroft <pa...@gmail.com> wrote:
>
> I have been developing a C# search solution for an application which has
> tens of millions of web pages. Most of these web pages are under 1 k.
>
> While our initial pilot was very encouraging on our tests of 1,000,000
> docs,
> when we scaled up to 10 million subsecond searches are now taking 8-10
> seconds.
>
> Where should I focus my efforts to increase search speed? Should I be
> using
> the RAMDirectory? MultiSearcher?
>
> We only have one machine right now which serves indexing and searching.
>
> TIA
>
> Pam
>
>