You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by James <ja...@ryley.com> on 2006/02/04 03:07:35 UTC

Best Lucene hardware

Hi,

 

I'm wondering if someone familiar with the way Lucene accesses data could
give their opinion on whether hard drive seek time or throughput is more
important in Lucene performance, assuming a very large index that cannot fit
in RAM.  I'm looking at buying some new servers that will be running Lucene,
and wonder if I should go with SCSI RAID, or if perhaps spending the extra
money on processors (and going with SATA for drives) is better.  I'm not
sure where the bottleneck is in an average system, and I don't have any SCSI
RAID systems available for testing.

 

Thanks,

James


Re: Term Vectors -- searching or just ranking?

Posted by Fredrik Andersson <fi...@gmail.com>.
Hi James,

I can't speak for anyone else, but my experience is that the general
approach is to first select a subset based on the angle between the query
vector and the document vector, in their non-reduced forms (this is a normal
search-for-keyword, what Lucene does by default, in vector notation). From
there, you pick up the (subset) documents along with their reduced term
vectors and compare their angle toward the reduced query vector.
If you skip the first step, you will have one dot product (query vector and
document vector) for every document in your database, but you will only need
to store the reduced term vectors. That's a lot of computation, but it's
necessary if you want to match documents that are related to a query but
does not contain any/some of the words in it. In my experience, the
advantages of this approach is a cool feature, but the hits returned are
usually pretty shitty. If you don't get a hit on a normal keyword search,
just leave the document (note, this is only my oppinion).
Some terminology if you did not follow: "reduced" refers to the projection
of a vector on to a smaller subspace (you can normally reduce the dimension
/ column space of the term-document matrix by ~60% and have virtually no
loss of precision in your searches). See "singular value decomposition", for
that matter.

Hope that helps,
Fredrik




On 4/20/06, James <ja...@ryley.com> wrote:
>
> Hi,
>
> We are implementing term vectors, and there is something about which I am
> unclear:  Can term vectors be used to perform a search in its entirety
> (e.g., rank all 1 million documents in a database order, and then return
> the
> top 100), or, due to computational time requirements, are term vectors
> only
> intended to be a ranking method for a small subset of data that is the
> result of a Boolean search (e.g., we know the 100 documents that possible
> answers, now put them in relevancy order)?
>
> Thanks,
> James
>
>

RE: Best Lucene hardware

Posted by James <ja...@ryley.com>.
Thanks for the feedback.  I saw those solid-state hard drives, and those are
definitely an interesting option if I am I/O limited.  But, I suspect that I
am CPU limited, which (ironically, after all the investigation that I have
done), seems to make commodity server farms the best option.

Thanks,
James

> Dear James,
> 
> I recently had the same question, but no definitive answer to offer.
> 
> I guess that throughput/access time requirements depend on:
> a) document size (the larger the document, the more the throughput might
> be important)
> b) how many documents you want to actually read (only a few to display
> them, or all to do some processing with them)
>         If you want to read many documents, seek time becomes more
> important
> 
> My best guess is that access time is more important for you, unless you
> store only very few very large documents.
> 
> Of course you should look for native command queuing discs (the disc may
> reorder the read commands to reduce seek time).
> 
> Another option (if your memory requirements are not so huge) : Solid state
> disk, see e.g.
> http://techreport.com/reviews/2006q1/gigabyte-iram/index.x?pg=7
> 
> The second version shall support up to 16Gbyte, see
> http://www.vr-zone.com.sg/?i=3052
> 
> Best regards,
> 
> Wolfgang



Term Vectors -- searching or just ranking?

Posted by James <ja...@ryley.com>.
Hi,

We are implementing term vectors, and there is something about which I am
unclear:  Can term vectors be used to perform a search in its entirety
(e.g., rank all 1 million documents in a database order, and then return the
top 100), or, due to computational time requirements, are term vectors only
intended to be a ranking method for a small subset of data that is the
result of a Boolean search (e.g., we know the 100 documents that possible
answers, now put them in relevancy order)?

Thanks,
James


RE: Best Lucene hardware

Posted by WolfgangTäger <wt...@epo.org>.
Dear James,

I recently had the same question, but no definitive answer to offer.

I guess that throughput/access time requirements depend on:
a) document size (the larger the document, the more the throughput might 
be important)
b) how many documents you want to actually read (only a few to display 
them, or all to do some processing with them)
        If you want to read many documents, seek time becomes more 
important

My best guess is that access time is more important for you, unless you 
store only very few very large documents.

Of course you should look for native command queuing discs (the disc may 
reorder the read commands to reduce seek time).

Another option (if your memory requirements are not so huge) : Solid state 
disk, see e.g. 
http://techreport.com/reviews/2006q1/gigabyte-iram/index.x?pg=7

The second version shall support up to 16Gbyte, see
http://www.vr-zone.com.sg/?i=3052

Best regards,

Wolfgang
 
 

 
 
 



"James" <ja...@ryley.com> 
05-02-2006 18:12
Please respond to
general@lucene.apache.org


To
<ge...@lucene.apache.org>
cc

Subject
RE: Best Lucene hardware






Hi,

Thanks for the info.  Unfortunately, most of that has to do with indexing,
whereas I am concerned with retrieval speed.  And, there really isn't 
enough
information there to make good comparisons -- there are several completely
different systems with no way to pin down what the important changes in
hardware are.  But, thanks for the link!

Sincerely,
James

> -----Original Message-----
> From: sivan v [mailto:sivanarul_v@yahoo.com]
> Sent: Sunday, February 05, 2006 9:47 AM
> To: general@lucene.apache.org
> Subject: Re: Best Lucene hardware
> 
> hello Mr.james,
> 
>   u can get some info from the following link...
> 
>   http://lucene.apache.org/java/docs/benchmarks.html




RE: Best Lucene hardware

Posted by James <ja...@ryley.com>.
Hi,

Thanks for the info.  Unfortunately, most of that has to do with indexing,
whereas I am concerned with retrieval speed.  And, there really isn't enough
information there to make good comparisons -- there are several completely
different systems with no way to pin down what the important changes in
hardware are.  But, thanks for the link!

Sincerely,
James

> -----Original Message-----
> From: sivan v [mailto:sivanarul_v@yahoo.com]
> Sent: Sunday, February 05, 2006 9:47 AM
> To: general@lucene.apache.org
> Subject: Re: Best Lucene hardware
> 
> hello Mr.james,
> 
>   u can get some info from the following link...
> 
>   http://lucene.apache.org/java/docs/benchmarks.html



Re: Best Lucene hardware

Posted by sivan v <si...@yahoo.com>.
hello Mr.james,
   
  u can get some info from the following link...
   
  http://lucene.apache.org/java/docs/benchmarks.html
   
   
  

James <ja...@ryley.com> wrote:
  Hi,



I'm wondering if someone familiar with the way Lucene accesses data could
give their opinion on whether hard drive seek time or throughput is more
important in Lucene performance, assuming a very large index that cannot fit
in RAM. I'm looking at buying some new servers that will be running Lucene,
and wonder if I should go with SCSI RAID, or if perhaps spending the extra
money on processors (and going with SATA for drives) is better. I'm not
sure where the bottleneck is in an average system, and I don't have any SCSI
RAID systems available for testing.



Thanks,

James




      Enduringly your's,
  V.Sivanarul.,M.Tech.




		
---------------------------------
Relax. Yahoo! Mail virus scanning helps detect nasty viruses!