You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by VishalS <vi...@rediff.co.in> on 2009/01/06 14:41:49 UTC

Search performance for large indexes (>100M docs)

Hi,

 

  I am experimenting with a system with around 120 million documents. The
index is split into sub-indices of ~10M documents - each such index is being
searched by a single machine. The results are being aggregated using the
DistributedSearcher client. I am seeing a lot of performance issues with the
system - most of the times, the response times are >4 seconds, and in some
cases it goes upto a minute.

 

  It would be wonderful to know if there are ways to optimize what I am
doing, or if there is something obvious that I am doing wrong. Here's what I
have tried so far, and the issues I see:

 

1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
CPUs running Linux. However, the searcher is not able to use more than 1 GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a Lucene
issue. Is there a way we can have the searcher use more RAM to speed things
up?
2.	The total size of the index directory on each machine is ~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file
is around 3GB. Is this too big?
3.	I have tried analyzing my documents for commonly occurring terms in
various fields, and added these terms to common-terms.utf8. There are ~10K
terms in this file for me now. I am hoping this will help me speed up any
phrase queries I am doing internally (although there is a cost attached in
terms of the number of unique terms in the Lucene index, the total index
size has increased by ~10-15%, which I guess is ok.)
4.	There are around 8 fields that are searched in for each of the words
in the query. Also, a phrase query containing all the words is fired in each
of these fields as well. This means that for a 3 word input query, the
number of sub-queries in my Lucene query are 24(3*8) term queries and 8(1*8)
3-word phrase queries. Is this too long or too expensive? 
5.	I have noticed that the slowest running queries (it takes upto a
minute sometimes) are many times the ones that have one or more common
words.
6.	Each individual searcher has a single Lucene indexlet. Would it be
faster to have more than 1 indexlet on the machine?
7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
changes in the number of threads, the java stack size allocation. 

 

If there's anyone else who has had experience working with large indices, I
would love to get in touch and exchange notes.

 

Regards,

 

-Vishal.

Re: Search performance for large indexes (>100M docs)

Posted by Dennis Kubes <ku...@apache.org>.

Essentially you would create a tempfs (ramdisk) and put the indexes in 
the tempfs.  Assuming your indexes were in a folder called indexes.dist, 
you would use something like this:

mount -t tmpfs -o size=7516192768 none /your/indexes
rsync --progress -aptv /your/indexes.dist/* /your/indexes/

You will also want to check the mailing list for indexes in Ram.  I 
believe I posted a much more detailed set of instructions before.

Dennis


ianwong wrote:
> Can you tell me how to Keep indexes in RAM in nutch query server if client
> side uses DistributedSearcher.
> 
> Thanks
> Ian
> 
> 
> 
> Dennis Kubes-2 wrote:
>> Take a look on the mailing lists for keeping the indexes in memory. 
>> When you get to the sizes you are talking about, the way you get 
>> subsecond response times is by:
>>
>> 1) Keeping the indexes in RAM
>> 2) Agressive caching
>>
>> Dennis
>>
>> VishalS wrote:
>>> Hi,
>>>
>>>  
>>>
>>>   I am experimenting with a system with around 120 million documents. The
>>> index is split into sub-indices of ~10M documents - each such index is
>>> being
>>> searched by a single machine. The results are being aggregated using the
>>> DistributedSearcher client. I am seeing a lot of performance issues with
>>> the
>>> system - most of the times, the response times are >4 seconds, and in
>>> some
>>> cases it goes upto a minute.
>>>
>>>  
>>>
>>>   It would be wonderful to know if there are ways to optimize what I am
>>> doing, or if there is something obvious that I am doing wrong. Here's
>>> what I
>>> have tried so far, and the issues I see:
>>>
>>>  
>>>
>>> 1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
>>> CPUs running Linux. However, the searcher is not able to use more than 1
>>> GB
>>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>>> Lucene
>>> issue. Is there a way we can have the searcher use more RAM to speed
>>> things
>>> up?
>>> 2.	The total size of the index directory on each machine is ~70-100 GB.
>>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>>> file
>>> is around 3GB. Is this too big?
>>> 3.	I have tried analyzing my documents for commonly occurring terms in
>>> various fields, and added these terms to common-terms.utf8. There are
>>> ~10K
>>> terms in this file for me now. I am hoping this will help me speed up any
>>> phrase queries I am doing internally (although there is a cost attached
>>> in
>>> terms of the number of unique terms in the Lucene index, the total index
>>> size has increased by ~10-15%, which I guess is ok.)
>>> 4.	There are around 8 fields that are searched in for each of the words
>>> in the query. Also, a phrase query containing all the words is fired in
>>> each
>>> of these fields as well. This means that for a 3 word input query, the
>>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>>> 8(1*8)
>>> 3-word phrase queries. Is this too long or too expensive? 
>>> 5.	I have noticed that the slowest running queries (it takes upto a
>>> minute sometimes) are many times the ones that have one or more common
>>> words.
>>> 6.	Each individual searcher has a single Lucene indexlet. Would it be
>>> faster to have more than 1 indexlet on the machine?
>>> 7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
>>> changes in the number of threads, the java stack size allocation. 
>>>
>>>  
>>>
>>> If there's anyone else who has had experience working with large indices,
>>> I
>>> would love to get in touch and exchange notes.
>>>
>>>  
>>>
>>> Regards,
>>>
>>>  
>>>
>>> -Vishal.
>>>
>>>
>>
>

Re: Search performance for large indexes (>100M docs)

Posted by ianwong <yi...@hotmail.com>.

Can you tell me how to Keep indexes in RAM in nutch query server if client
side uses DistributedSearcher.

Thanks
Ian



Dennis Kubes-2 wrote:
> 
> Take a look on the mailing lists for keeping the indexes in memory. 
> When you get to the sizes you are talking about, the way you get 
> subsecond response times is by:
> 
> 1) Keeping the indexes in RAM
> 2) Agressive caching
> 
> Dennis
> 
> VishalS wrote:
>> Hi,
>> 
>>  
>> 
>>   I am experimenting with a system with around 120 million documents. The
>> index is split into sub-indices of ~10M documents - each such index is
>> being
>> searched by a single machine. The results are being aggregated using the
>> DistributedSearcher client. I am seeing a lot of performance issues with
>> the
>> system - most of the times, the response times are >4 seconds, and in
>> some
>> cases it goes upto a minute.
>> 
>>  
>> 
>>   It would be wonderful to know if there are ways to optimize what I am
>> doing, or if there is something obvious that I am doing wrong. Here's
>> what I
>> have tried so far, and the issues I see:
>> 
>>  
>> 
>> 1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
>> CPUs running Linux. However, the searcher is not able to use more than 1
>> GB
>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>> Lucene
>> issue. Is there a way we can have the searcher use more RAM to speed
>> things
>> up?
>> 2.	The total size of the index directory on each machine is ~70-100 GB.
>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>> file
>> is around 3GB. Is this too big?
>> 3.	I have tried analyzing my documents for commonly occurring terms in
>> various fields, and added these terms to common-terms.utf8. There are
>> ~10K
>> terms in this file for me now. I am hoping this will help me speed up any
>> phrase queries I am doing internally (although there is a cost attached
>> in
>> terms of the number of unique terms in the Lucene index, the total index
>> size has increased by ~10-15%, which I guess is ok.)
>> 4.	There are around 8 fields that are searched in for each of the words
>> in the query. Also, a phrase query containing all the words is fired in
>> each
>> of these fields as well. This means that for a 3 word input query, the
>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>> 8(1*8)
>> 3-word phrase queries. Is this too long or too expensive? 
>> 5.	I have noticed that the slowest running queries (it takes upto a
>> minute sometimes) are many times the ones that have one or more common
>> words.
>> 6.	Each individual searcher has a single Lucene indexlet. Would it be
>> faster to have more than 1 indexlet on the machine?
>> 7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
>> changes in the number of threads, the java stack size allocation. 
>> 
>>  
>> 
>> If there's anyone else who has had experience working with large indices,
>> I
>> would love to get in touch and exchange notes.
>> 
>>  
>> 
>> Regards,
>> 
>>  
>> 
>> -Vishal.
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Search-performance-for-large-indexes-%28%3E100M-docs%29-tp21315030p21380293.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search performance for large indexes (>100M docs)

Posted by Dennis Kubes <ku...@apache.org>.


buddha1021 wrote:
> hi dennis:
>  in your opinion,which is the most important reason for the fast search
> speed of google :
> 1 google's programme(Code) is very excellence.     or

Yes.  They are performance fanatics (literally).  But there is only so 
much you are going to be able to optimize code, even if it is written in 
assembly.

> 2 google put all the indexes into the RAM.

Yes.  They would have too.  I don't see another way.  Not saying there 
isn't one, but I haven't found it yet.

> which is the most important reason?

I think having the indexes in RAM is the most important factor.  But 
having a large caching layer in front of the index is also important. 
Also having a supplemental index is another key factor IMO.

> 
> and,if nutch put all the indexes into RAM,can nutch's search speed be as
> fast as google?

Yes I definitely think it is possible to get Google speed with Nutch. 
Check out www.visvo.com or search.wikia.com.  Both use in memory 
indexes.  It is not something you would just deploy and have out of the 
box.  Google is as fast as they are because they have built out every 
step efficiently from the code to the operations to the bandwidth and DNS.

Dennis

> 
> 
> Dennis Kubes-2 wrote:
>> Take a look on the mailing lists for keeping the indexes in memory. 
>> When you get to the sizes you are talking about, the way you get 
>> subsecond response times is by:
>>
>> 1) Keeping the indexes in RAM
>> 2) Agressive caching
>>
>> Dennis
>>
>> VishalS wrote:
>>> Hi,
>>>
>>>  
>>>
>>>   I am experimenting with a system with around 120 million documents. The
>>> index is split into sub-indices of ~10M documents - each such index is
>>> being
>>> searched by a single machine. The results are being aggregated using the
>>> DistributedSearcher client. I am seeing a lot of performance issues with
>>> the
>>> system - most of the times, the response times are >4 seconds, and in
>>> some
>>> cases it goes upto a minute.
>>>
>>>  
>>>
>>>   It would be wonderful to know if there are ways to optimize what I am
>>> doing, or if there is something obvious that I am doing wrong. Here's
>>> what I
>>> have tried so far, and the issues I see:
>>>
>>>  
>>>
>>> 1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
>>> CPUs running Linux. However, the searcher is not able to use more than 1
>>> GB
>>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>>> Lucene
>>> issue. Is there a way we can have the searcher use more RAM to speed
>>> things
>>> up?
>>> 2.	The total size of the index directory on each machine is ~70-100 GB.
>>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>>> file
>>> is around 3GB. Is this too big?
>>> 3.	I have tried analyzing my documents for commonly occurring terms in
>>> various fields, and added these terms to common-terms.utf8. There are
>>> ~10K
>>> terms in this file for me now. I am hoping this will help me speed up any
>>> phrase queries I am doing internally (although there is a cost attached
>>> in
>>> terms of the number of unique terms in the Lucene index, the total index
>>> size has increased by ~10-15%, which I guess is ok.)
>>> 4.	There are around 8 fields that are searched in for each of the words
>>> in the query. Also, a phrase query containing all the words is fired in
>>> each
>>> of these fields as well. This means that for a 3 word input query, the
>>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>>> 8(1*8)
>>> 3-word phrase queries. Is this too long or too expensive? 
>>> 5.	I have noticed that the slowest running queries (it takes upto a
>>> minute sometimes) are many times the ones that have one or more common
>>> words.
>>> 6.	Each individual searcher has a single Lucene indexlet. Would it be
>>> faster to have more than 1 indexlet on the machine?
>>> 7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
>>> changes in the number of threads, the java stack size allocation. 
>>>
>>>  
>>>
>>> If there's anyone else who has had experience working with large indices,
>>> I
>>> would love to get in touch and exchange notes.
>>>
>>>  
>>>
>>> Regards,
>>>
>>>  
>>>
>>> -Vishal.
>>>
>>>
>>
>

Re: Search performance for large indexes (>100M docs)

Posted by buddha1021 <bu...@yahoo.cn>.

hi dennis:
 in your opinion,which is the most important reason for the fast search
speed of google :
1 google's programme(Code) is very excellence.     or
2 google put all the indexes into the RAM.
which is the most important reason?

and,if nutch put all the indexes into RAM,can nutch's search speed be as
fast as google?


Dennis Kubes-2 wrote:
> 
> Take a look on the mailing lists for keeping the indexes in memory. 
> When you get to the sizes you are talking about, the way you get 
> subsecond response times is by:
> 
> 1) Keeping the indexes in RAM
> 2) Agressive caching
> 
> Dennis
> 
> VishalS wrote:
>> Hi,
>> 
>>  
>> 
>>   I am experimenting with a system with around 120 million documents. The
>> index is split into sub-indices of ~10M documents - each such index is
>> being
>> searched by a single machine. The results are being aggregated using the
>> DistributedSearcher client. I am seeing a lot of performance issues with
>> the
>> system - most of the times, the response times are >4 seconds, and in
>> some
>> cases it goes upto a minute.
>> 
>>  
>> 
>>   It would be wonderful to know if there are ways to optimize what I am
>> doing, or if there is something obvious that I am doing wrong. Here's
>> what I
>> have tried so far, and the issues I see:
>> 
>>  
>> 
>> 1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
>> CPUs running Linux. However, the searcher is not able to use more than 1
>> GB
>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>> Lucene
>> issue. Is there a way we can have the searcher use more RAM to speed
>> things
>> up?
>> 2.	The total size of the index directory on each machine is ~70-100 GB.
>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>> file
>> is around 3GB. Is this too big?
>> 3.	I have tried analyzing my documents for commonly occurring terms in
>> various fields, and added these terms to common-terms.utf8. There are
>> ~10K
>> terms in this file for me now. I am hoping this will help me speed up any
>> phrase queries I am doing internally (although there is a cost attached
>> in
>> terms of the number of unique terms in the Lucene index, the total index
>> size has increased by ~10-15%, which I guess is ok.)
>> 4.	There are around 8 fields that are searched in for each of the words
>> in the query. Also, a phrase query containing all the words is fired in
>> each
>> of these fields as well. This means that for a 3 word input query, the
>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>> 8(1*8)
>> 3-word phrase queries. Is this too long or too expensive? 
>> 5.	I have noticed that the slowest running queries (it takes upto a
>> minute sometimes) are many times the ones that have one or more common
>> words.
>> 6.	Each individual searcher has a single Lucene indexlet. Would it be
>> faster to have more than 1 indexlet on the machine?
>> 7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
>> changes in the number of threads, the java stack size allocation. 
>> 
>>  
>> 
>> If there's anyone else who has had experience working with large indices,
>> I
>> would love to get in touch and exchange notes.
>> 
>>  
>> 
>> Regards,
>> 
>>  
>> 
>> -Vishal.
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Search-performance-for-large-indexes-%28%3E100M-docs%29-tp21315030p21364933.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search performance for large indexes (>100M docs)

Posted by Dennis Kubes <ku...@apache.org>.

Take a look on the mailing lists for keeping the indexes in memory. 
When you get to the sizes you are talking about, the way you get 
subsecond response times is by:

1) Keeping the indexes in RAM
2) Agressive caching

Dennis

VishalS wrote:
> Hi,
> 
>  
> 
>   I am experimenting with a system with around 120 million documents. The
> index is split into sub-indices of ~10M documents - each such index is being
> searched by a single machine. The results are being aggregated using the
> DistributedSearcher client. I am seeing a lot of performance issues with the
> system - most of the times, the response times are >4 seconds, and in some
> cases it goes upto a minute.
> 
>  
> 
>   It would be wonderful to know if there are ways to optimize what I am
> doing, or if there is something obvious that I am doing wrong. Here's what I
> have tried so far, and the issues I see:
> 
>  
> 
> 1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
> CPUs running Linux. However, the searcher is not able to use more than 1 GB
> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a Lucene
> issue. Is there a way we can have the searcher use more RAM to speed things
> up?
> 2.	The total size of the index directory on each machine is ~70-100 GB.
> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file
> is around 3GB. Is this too big?
> 3.	I have tried analyzing my documents for commonly occurring terms in
> various fields, and added these terms to common-terms.utf8. There are ~10K
> terms in this file for me now. I am hoping this will help me speed up any
> phrase queries I am doing internally (although there is a cost attached in
> terms of the number of unique terms in the Lucene index, the total index
> size has increased by ~10-15%, which I guess is ok.)
> 4.	There are around 8 fields that are searched in for each of the words
> in the query. Also, a phrase query containing all the words is fired in each
> of these fields as well. This means that for a 3 word input query, the
> number of sub-queries in my Lucene query are 24(3*8) term queries and 8(1*8)
> 3-word phrase queries. Is this too long or too expensive? 
> 5.	I have noticed that the slowest running queries (it takes upto a
> minute sometimes) are many times the ones that have one or more common
> words.
> 6.	Each individual searcher has a single Lucene indexlet. Would it be
> faster to have more than 1 indexlet on the machine?
> 7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
> changes in the number of threads, the java stack size allocation. 
> 
>  
> 
> If there's anyone else who has had experience working with large indices, I
> would love to get in touch and exchange notes.
> 
>  
> 
> Regards,
> 
>  
> 
> -Vishal.
> 
>

Re: Search performance for large indexes (>100M docs)

Posted by Alex Basa <al...@yahoo.com>.

yep...just add -d64 to use the 64 bit java.  I've had to up my XX also because some really large sites were making it run out of permgen space.  It also seemed that the 3.5GB -Xmx handles 90% of most sites.  I've had to bump mine up to 5GB for some really huge sites.

Anyone play with the number of threads setting?  I have mine at 10 but is hardly making a dent on the blade server running at max.  I was thinking of upping it to 20.


--- On Fri, 1/16/09, Mark Bennett <mb...@ideaeng.com> wrote:

> From: Mark Bennett <mb...@ideaeng.com>
> Subject: Re: Search performance for large indexes (>100M docs)
> To: nutch-user@lucene.apache.org
> Date: Friday, January 16, 2009, 12:04 PM
> Hi Dennis,
> 
> I don't follow this group in realtime, so I know this
> is a late reply, and
> if you reply to me please CC me directly.
> 
> I've had good luck with nutch and using tons of memory,
> I went well past 3
> Gigs.  To be fair, I don't know how much was nutch
> spidering vs. lucene
> indexing.
> http://www.enterprisesearchblog.com/2009/01/virtualization-and-search-performance-tests-summary.html
> 
> The only problem I can think of is that, for using large of
> memory, all
> three components to be 64 bit; ;64 bit chip, 64 bit OS and
> 64 bit JVM
> 
> I imagine you know this already, just triple checking.  I
> was using a stock
> Sun JVM, 64 bit jvm for Windows.  I suppose that could be a
> difference,
> maybe they don't provide a 64 bit jvm for linux?  Or
> maybe you were using
> somebody else's jvm?
> 
> Mark
> 
> PS: Reminder, if you reply to me, pls CC me.
> 
> --
> Mark Bennett / New Idea Engineering, Inc. /
> mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell:
> 408-829-6513
> 
> 
> On Tue, Jan 6, 2009 at 5:41 AM, VishalS
> <vi...@rediff.co.in> wrote:
> 
> > Hi,
> >
> >
> >
> >  I am experimenting with a system with around 120
> million documents. The
> > index is split into sub-indices of ~10M documents -
> each such index is
> > being
> > searched by a single machine. The results are being
> aggregated using the
> > DistributedSearcher client. I am seeing a lot of
> performance issues with
> > the
> > system - most of the times, the response times are
> >4 seconds, and in some
> > cases it goes upto a minute.
> >
> >
> >
> >  It would be wonderful to know if there are ways to
> optimize what I am
> > doing, or if there is something obvious that I am
> doing wrong. Here's what
> > I
> > have tried so far, and the issues I see:
> >
> >
> >
> > 1.      Each search server is a 64-bit Pentium machine
> with ~7GB RAM and 4
> > CPUs running Linux. However, the searcher is not able
> to use more than 1 GB
> > of RAM even though I have set -Xmx to ~3.5GB. I am
> guessing this is a
> > Lucene
> > issue. Is there a way we can have the searcher use
> more RAM to speed things
> > up?
> > 2.      The total size of the index directory on each
> machine is ~70-100
> > GB.
> > The prx file is 50GB, the fnm and frq files are ~27GB
> each and the fdt file
> > is around 3GB. Is this too big?
> > 3.      I have tried analyzing my documents for
> commonly occurring terms in
> > various fields, and added these terms to
> common-terms.utf8. There are ~10K
> > terms in this file for me now. I am hoping this will
> help me speed up any
> > phrase queries I am doing internally (although there
> is a cost attached in
> > terms of the number of unique terms in the Lucene
> index, the total index
> > size has increased by ~10-15%, which I guess is ok.)
> > 4.      There are around 8 fields that are searched in
> for each of the
> > words
> > in the query. Also, a phrase query containing all the
> words is fired in
> > each
> > of these fields as well. This means that for a 3 word
> input query, the
> > number of sub-queries in my Lucene query are 24(3*8)
> term queries and
> > 8(1*8)
> > 3-word phrase queries. Is this too long or too
> expensive?
> > 5.      I have noticed that the slowest running
> queries (it takes upto a
> > minute sometimes) are many times the ones that have
> one or more common
> > words.
> > 6.      Each individual searcher has a single Lucene
> indexlet. Would it be
> > faster to have more than 1 indexlet on the machine?
> > 7.      I am using a tomcat 6.0 installation
> out-of-the-box, with some
> > minor
> > changes in the number of threads, the java stack size
> allocation.
> >
> >
> >
> > If there's anyone else who has had experience
> working with large indices, I
> > would love to get in touch and exchange notes.
> >
> >
> >
> > Regards,
> >
> >
> >
> > -Vishal.
> >
> >

Re: Search performance for large indexes (>100M docs)

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Dennis,

I don't follow this group in realtime, so I know this is a late reply, and
if you reply to me please CC me directly.

I've had good luck with nutch and using tons of memory, I went well past 3
Gigs.  To be fair, I don't know how much was nutch spidering vs. lucene
indexing.
http://www.enterprisesearchblog.com/2009/01/virtualization-and-search-performance-tests-summary.html

The only problem I can think of is that, for using large of memory, all
three components to be 64 bit; ;64 bit chip, 64 bit OS and 64 bit JVM

I imagine you know this already, just triple checking.  I was using a stock
Sun JVM, 64 bit jvm for Windows.  I suppose that could be a difference,
maybe they don't provide a 64 bit jvm for linux?  Or maybe you were using
somebody else's jvm?

Mark

PS: Reminder, if you reply to me, pls CC me.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Tue, Jan 6, 2009 at 5:41 AM, VishalS <vi...@rediff.co.in> wrote:

> Hi,
>
>
>
>  I am experimenting with a system with around 120 million documents. The
> index is split into sub-indices of ~10M documents - each such index is
> being
> searched by a single machine. The results are being aggregated using the
> DistributedSearcher client. I am seeing a lot of performance issues with
> the
> system - most of the times, the response times are >4 seconds, and in some
> cases it goes upto a minute.
>
>
>
>  It would be wonderful to know if there are ways to optimize what I am
> doing, or if there is something obvious that I am doing wrong. Here's what
> I
> have tried so far, and the issues I see:
>
>
>
> 1.      Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
> CPUs running Linux. However, the searcher is not able to use more than 1 GB
> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
> Lucene
> issue. Is there a way we can have the searcher use more RAM to speed things
> up?
> 2.      The total size of the index directory on each machine is ~70-100
> GB.
> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file
> is around 3GB. Is this too big?
> 3.      I have tried analyzing my documents for commonly occurring terms in
> various fields, and added these terms to common-terms.utf8. There are ~10K
> terms in this file for me now. I am hoping this will help me speed up any
> phrase queries I am doing internally (although there is a cost attached in
> terms of the number of unique terms in the Lucene index, the total index
> size has increased by ~10-15%, which I guess is ok.)
> 4.      There are around 8 fields that are searched in for each of the
> words
> in the query. Also, a phrase query containing all the words is fired in
> each
> of these fields as well. This means that for a 3 word input query, the
> number of sub-queries in my Lucene query are 24(3*8) term queries and
> 8(1*8)
> 3-word phrase queries. Is this too long or too expensive?
> 5.      I have noticed that the slowest running queries (it takes upto a
> minute sometimes) are many times the ones that have one or more common
> words.
> 6.      Each individual searcher has a single Lucene indexlet. Would it be
> faster to have more than 1 indexlet on the machine?
> 7.      I am using a tomcat 6.0 installation out-of-the-box, with some
> minor
> changes in the number of threads, the java stack size allocation.
>
>
>
> If there's anyone else who has had experience working with large indices, I
> would love to get in touch and exchange notes.
>
>
>
> Regards,
>
>
>
> -Vishal.
>
>

Re: Search performance for large indexes (>100M docs)

Posted by buddha1021 <bu...@yahoo.cn>.

Hi  Sean Dean-3:
“I have one index just above 20 million that takes up about 29GB in space.”
It's very very great! The difficult for me is the size of the indexes! It's
too large! If 20 million's indexes is only ~30G, the difficulty can be
solve!
 If your idear become a reality,search 100 million will become a reality
too! This is also my goal.
Come on! 
Look forward to your test results!
.................................................................
I'm going to have to disagree and also explain my reasoning.

A 32GB SLC (or even MLC if we talk about size) is capable of holding a 20
million page index. I have one index just above 20 million that takes up
about 29GB in space.


Hadoop is designed to run on commodity PCs but when I talk about a "1U
server" its more like I'm referring to use a 1U server chassis with
commodity type parts inside. The reason for this is co-location, I cant
place multiple desktop or towers at the datacenter.

You would only need 5 machines to search 100 million pages, although this
isn't taking speed into consideration.

........................................
-- 
View this message in context: http://www.nabble.com/Search-performance-for-large-indexes-%28%3E100M-docs%29-tp21315030p21469788.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search performance for large indexes (>100M docs)

Posted by Sean Dean <se...@rogers.com>.

I'm going to have to disagree and also explain my reasoning.

A 32GB SLC (or even MLC if we talk about size) is capable of holding a 20 million page index. I have one index just above 20 million that takes up about 29GB in space.


Hadoop is designed to run on commodity PCs but when I talk about a "1U server" its more like I'm referring to use a 1U server chassis with commodity type parts inside. The reason for this is co-location, I cant place multiple desktop or towers at the datacenter.

You would only need 5 machines to search 100 million pages, although this isn't taking speed into consideration.



________________________________
From: buddha1021 <bu...@yahoo.cn>
To: nutch-user@lucene.apache.org
Sent: Wednesday, January 14, 2009 10:12:58 AM
Subject: Re: Search performance for large indexes (>100M docs)


how many pages do the 32G SLC SSD indexes Contain? 20millions？I don't think
so.

and the  hadoop runs on the common PC,in my opinion ,to speed up the search
speed ,we shoud increase the number of the common PC. A common PC can also
achive dual-core, 8GB RAM, 260GB HD,and the SLC SSD is also expensive,so,
Why do we still want to use 1U server but not common PC? Hadoop is
Originally designed for the common PC but not server .If we use the hadoop
to run on the server but not common PC,and at the same time we have to put
all the indexes into the RAM to speed up the speed .In this case ,the hadoop
is no use basically!

I think we can use common PC(dual-core, 8GB RAM, 260GB HD) (but not 1U
server )to achive the very fast speed which we require.The biggest
difficulty is we don't know the specific number of the nodes to search 100
millions.

Does anyone know which is the most reason for the hadoop' compute speed？
For hadoop,I know the Principle and the api of the M/R and the HDFS only,
but I don't have read the hadoop's code. 
In my opinion,anyone who really understandes the hadoop can knows how to
increase the search speed!



Sean Dean-3 wrote:
> 
> You might also want to research using single-layer cell SSDs instead of
> bulking up on RAM. Google has started using them in new search servers to
> save power but also speed up search I/O.
> 
> I wouldn't suggest performing Hadoop operations on them (e.g. crawling or
> indexing) since mechanical hard drives will still be more efficient but
> they are well designed for searches, with near zero latency and ultra fast
> sequential read speed.
> 
> In the next few months I plan to experiment with a few search servers
> hosting indexes off these types of drives. I would like to see if a 20
> million index shard (25-30GB in size - the theoretical Nutch limit per
> server) would have comparable overall speed vs. something smaller in RAM.
> If my suspicions are correct and it is, then this would represent
> considerable cost saving to anyone looking to go down the >100 million
> document road.
> 
> Its my dream to see (5) dual-core, 8GB RAM, 500GB HD and 32GB SLC SSD 1U
> servers each running at 0.75 amps hosting 20 million pages, combining for
> 100 million overall at about $1000.00 per server. Then you could put 18 in
> a half-rack with a standard 15A line and serve close to 360 million
> results. I have a dream, and my god I'm going to try it.
> 
> 
> 
> 
> 
> ________________________________
> From: Dennis Kubes <ku...@apache.org>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, January 8, 2009 10:22:09 PM
> Subject: Re: Search performance for large indexes (>100M docs)
> 
> 
> 
> buddha1021 wrote:
>> hi dennis:
>>  in your opinion,which is the most important reason for the fast search
>> speed of google :
>> 1 google's programme(Code) is very excellence.    or
> 
> Yes.  They are performance fanatics (literally).  But there is only so 
> much you are going to be able to optimize code, even if it is written in 
> assembly.
> 
>> 2 google put all the indexes into the RAM.
> 
> Yes.  They would have too.  I don't see another way.  Not saying there 
> isn't one, but I haven't found it yet.
> 
>> which is the most important reason?
> 
> I think having the indexes in RAM is the most important factor.  But 
> having a large caching layer in front of the index is also important. 
> Also having a supplemental index is another key factor IMO.
> 
>> 
>> and,if nutch put all the indexes into RAM,can nutch's search speed be as
>> fast as google?
> 
> Yes I definitely think it is possible to get Google speed with Nutch. 
> Check out www.visvo.com or search.wikia.com.  Both use in memory 
> indexes.  It is not something you would just deploy and have out of the 
> box.  Google is as fast as they are because they have built out every 
> step efficiently from the code to the operations to the bandwidth and DNS.
> 
> Dennis
> 
>> 
>> 
>> Dennis Kubes-2 wrote:
>>> Take a look on the mailing lists for keeping the indexes in memory. 
>>> When you get to the sizes you are talking about, the way you get 
>>> subsecond response times is by:
>>>
>>> 1) Keeping the indexes in RAM
>>> 2) Agressive caching
>>>
>>> Dennis
>>>
>>> VishalS wrote:
>>>> Hi,
>>>>
>>>>  
>>>>
>>>>  I am experimenting with a system with around 120 million documents.
>>>> The
>>>> index is split into sub-indices of ~10M documents - each such index is
>>>> being
>>>> searched by a single machine. The results are being aggregated using
>>>> the
>>>> DistributedSearcher client. I am seeing a lot of performance issues
>>>> with
>>>> the
>>>> system - most of the times, the response times are >4 seconds, and in
>>>> some
>>>> cases it goes upto a minute.
>>>>
>>>>  
>>>>
>>>>  It would be wonderful to know if there are ways to optimize what I am
>>>> doing, or if there is something obvious that I am doing wrong. Here's
>>>> what I
>>>> have tried so far, and the issues I see:
>>>>
>>>>  
>>>>
>>>> 1.    Each search server is a 64-bit Pentium machine with ~7GB RAM and
>>>> 4
>>>> CPUs running Linux. However, the searcher is not able to use more than
>>>> 1
>>>> GB
>>>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>>>> Lucene
>>>> issue. Is there a way we can have the searcher use more RAM to speed
>>>> things
>>>> up?
>>>> 2.    The total size of the index directory on each machine is ~70-100
>>>> GB.
>>>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>>>> file
>>>> is around 3GB. Is this too big?
>>>> 3.    I have tried analyzing my documents for commonly occurring terms
>>>> in
>>>> various fields, and added these terms to common-terms.utf8. There are
>>>> ~10K
>>>> terms in this file for me now. I am hoping this will help me speed up
>>>> any
>>>> phrase queries I am doing internally (although there is a cost attached
>>>> in
>>>> terms of the number of unique terms in the Lucene index, the total
>>>> index
>>>> size has increased by ~10-15%, which I guess is ok.)
>>>> 4.    There are around 8 fields that are searched in for each of the
>>>> words
>>>> in the query. Also, a phrase query containing all the words is fired in
>>>> each
>>>> of these fields as well. This means that for a 3 word input query, the
>>>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>>>> 8(1*8)
>>>> 3-word phrase queries. Is this too long or too expensive? 
>>>> 5.    I have noticed that the slowest running queries (it takes upto a
>>>> minute sometimes) are many times the ones that have one or more common
>>>> words.
>>>> 6.    Each individual searcher has a single Lucene indexlet. Would it
>>>> be
>>>> faster to have more than 1 indexlet on the machine?
>>>> 7.    I am using a tomcat 6.0 installation out-of-the-box, with some
>>>> minor
>>>> changes in the number of threads, the java stack size allocation. 
>>>>
>>>>  
>>>>
>>>> If there's anyone else who has had experience working with large
>>>> indices,
>>>> I
>>>> would love to get in touch and exchange notes.
>>>>
>>>>  
>>>>
>>>> Regards,
>>>>
>>>>  
>>>>
>>>> -Vishal.
>>>>
>>>>
>>>
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Search-performance-for-large-indexes-%28%3E100M-docs%29-tp21315030p21457690.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search performance for large indexes (>100M docs)

Posted by buddha1021 <bu...@yahoo.cn>.

how many pages do the 32G SLC SSD indexes Contain? 20millions？I don't think
so.

and the  hadoop runs on the common PC,in my opinion ,to speed up the search
speed ,we shoud increase the number of the common PC. A common PC can also
achive dual-core, 8GB RAM, 260GB HD,and the SLC SSD is also expensive,so,
Why do we still want to use 1U server but not common PC? Hadoop is
Originally designed for the common PC but not server .If we use the hadoop
to run on the server but not common PC,and at the same time we have to put
all the indexes into the RAM to speed up the speed .In this case ,the hadoop
is no use basically!

I think we can use common PC(dual-core, 8GB RAM, 260GB HD) (but not 1U
server )to achive the very fast speed which we require.The biggest
difficulty is we don't know the specific number of the nodes to search 100
millions.

Does anyone know which is the most reason for the hadoop' compute speed？
For hadoop,I know the Principle and the api of the M/R and the HDFS only,
but I don't have read the hadoop's code. 
In my opinion,anyone who really understandes the hadoop can knows how to
increase the search speed!



Sean Dean-3 wrote:
> 
> You might also want to research using single-layer cell SSDs instead of
> bulking up on RAM. Google has started using them in new search servers to
> save power but also speed up search I/O.
> 
> I wouldn't suggest performing Hadoop operations on them (e.g. crawling or
> indexing) since mechanical hard drives will still be more efficient but
> they are well designed for searches, with near zero latency and ultra fast
> sequential read speed.
> 
> In the next few months I plan to experiment with a few search servers
> hosting indexes off these types of drives. I would like to see if a 20
> million index shard (25-30GB in size - the theoretical Nutch limit per
> server) would have comparable overall speed vs. something smaller in RAM.
> If my suspicions are correct and it is, then this would represent
> considerable cost saving to anyone looking to go down the >100 million
> document road.
> 
> Its my dream to see (5) dual-core, 8GB RAM, 500GB HD and 32GB SLC SSD 1U
> servers each running at 0.75 amps hosting 20 million pages, combining for
> 100 million overall at about $1000.00 per server. Then you could put 18 in
> a half-rack with a standard 15A line and serve close to 360 million
> results. I have a dream, and my god I'm going to try it.
> 
> 
> 
> 
> 
> ________________________________
> From: Dennis Kubes <ku...@apache.org>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, January 8, 2009 10:22:09 PM
> Subject: Re: Search performance for large indexes (>100M docs)
> 
> 
> 
> buddha1021 wrote:
>> hi dennis:
>>  in your opinion,which is the most important reason for the fast search
>> speed of google :
>> 1 google's programme(Code) is very excellence.     or
> 
> Yes.  They are performance fanatics (literally).  But there is only so 
> much you are going to be able to optimize code, even if it is written in 
> assembly.
> 
>> 2 google put all the indexes into the RAM.
> 
> Yes.  They would have too.  I don't see another way.  Not saying there 
> isn't one, but I haven't found it yet.
> 
>> which is the most important reason?
> 
> I think having the indexes in RAM is the most important factor.  But 
> having a large caching layer in front of the index is also important. 
> Also having a supplemental index is another key factor IMO.
> 
>> 
>> and,if nutch put all the indexes into RAM,can nutch's search speed be as
>> fast as google?
> 
> Yes I definitely think it is possible to get Google speed with Nutch. 
> Check out www.visvo.com or search.wikia.com.  Both use in memory 
> indexes.  It is not something you would just deploy and have out of the 
> box.  Google is as fast as they are because they have built out every 
> step efficiently from the code to the operations to the bandwidth and DNS.
> 
> Dennis
> 
>> 
>> 
>> Dennis Kubes-2 wrote:
>>> Take a look on the mailing lists for keeping the indexes in memory. 
>>> When you get to the sizes you are talking about, the way you get 
>>> subsecond response times is by:
>>>
>>> 1) Keeping the indexes in RAM
>>> 2) Agressive caching
>>>
>>> Dennis
>>>
>>> VishalS wrote:
>>>> Hi,
>>>>
>>>>  
>>>>
>>>>   I am experimenting with a system with around 120 million documents.
>>>> The
>>>> index is split into sub-indices of ~10M documents - each such index is
>>>> being
>>>> searched by a single machine. The results are being aggregated using
>>>> the
>>>> DistributedSearcher client. I am seeing a lot of performance issues
>>>> with
>>>> the
>>>> system - most of the times, the response times are >4 seconds, and in
>>>> some
>>>> cases it goes upto a minute.
>>>>
>>>>  
>>>>
>>>>   It would be wonderful to know if there are ways to optimize what I am
>>>> doing, or if there is something obvious that I am doing wrong. Here's
>>>> what I
>>>> have tried so far, and the issues I see:
>>>>
>>>>  
>>>>
>>>> 1.    Each search server is a 64-bit Pentium machine with ~7GB RAM and
>>>> 4
>>>> CPUs running Linux. However, the searcher is not able to use more than
>>>> 1
>>>> GB
>>>> of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a
>>>> Lucene
>>>> issue. Is there a way we can have the searcher use more RAM to speed
>>>> things
>>>> up?
>>>> 2.    The total size of the index directory on each machine is ~70-100
>>>> GB.
>>>> The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt
>>>> file
>>>> is around 3GB. Is this too big?
>>>> 3.    I have tried analyzing my documents for commonly occurring terms
>>>> in
>>>> various fields, and added these terms to common-terms.utf8. There are
>>>> ~10K
>>>> terms in this file for me now. I am hoping this will help me speed up
>>>> any
>>>> phrase queries I am doing internally (although there is a cost attached
>>>> in
>>>> terms of the number of unique terms in the Lucene index, the total
>>>> index
>>>> size has increased by ~10-15%, which I guess is ok.)
>>>> 4.    There are around 8 fields that are searched in for each of the
>>>> words
>>>> in the query. Also, a phrase query containing all the words is fired in
>>>> each
>>>> of these fields as well. This means that for a 3 word input query, the
>>>> number of sub-queries in my Lucene query are 24(3*8) term queries and
>>>> 8(1*8)
>>>> 3-word phrase queries. Is this too long or too expensive? 
>>>> 5.    I have noticed that the slowest running queries (it takes upto a
>>>> minute sometimes) are many times the ones that have one or more common
>>>> words.
>>>> 6.    Each individual searcher has a single Lucene indexlet. Would it
>>>> be
>>>> faster to have more than 1 indexlet on the machine?
>>>> 7.    I am using a tomcat 6.0 installation out-of-the-box, with some
>>>> minor
>>>> changes in the number of threads, the java stack size allocation. 
>>>>
>>>>  
>>>>
>>>> If there's anyone else who has had experience working with large
>>>> indices,
>>>> I
>>>> would love to get in touch and exchange notes.
>>>>
>>>>  
>>>>
>>>> Regards,
>>>>
>>>>  
>>>>
>>>> -Vishal.
>>>>
>>>>
>>>
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Search-performance-for-large-indexes-%28%3E100M-docs%29-tp21315030p21457690.html
Sent from the Nutch - User mailing list archive at Nabble.com.