You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by raistlink <el...@gmail.com> on 2009/05/18 11:42:02 UTC

Max size of index? How do search engines avoid this?

Hi,
I think I've read that there is a limit for de index, may be 2Gb for fat
machines. If this is right I ask you for good resources (webs or books)
about programming search engines to know about the techniques used by big
search engines to search among such huge data.

Thanks
-- 
View this message in context: http://www.nabble.com/Max-size-of-index--How-do-search-engines-avoid-this--tp23594241p23594241.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Max size of index? How do search engines avoid this?

Posted by mark harwood <ma...@yahoo.co.uk>.
>techniques used by big search engines to search among such huge data.

Two keywords here - partitioning and replication.

Partitioning is breaking the content down into shards and assigning shards to servers. These can then be queried in parallel to make search response times independent of the data volumes being searched. I seem to remember a quote that a single Google search currently gets spread across ~1,000 servers in parallel.

Replication is about handling user volumes - take each shard and assign it to many replica servers then load balance requests across them to spread the load. This also gives you redundancy and helps in recovery from machine failure.


You may want to take a look at Solr to help you with this.

Cheers
Mark





----- Original Message ----
From: raistlink <el...@gmail.com>
To: java-user@lucene.apache.org
Sent: Monday, 18 May, 2009 10:42:02
Subject: Max size of index? How do search engines avoid this?


Hi,
I think I've read that there is a limit for de index, may be 2Gb for fat
machines. If this is right I ask you for good resources (webs or books)
about programming search engines to know about the techniques used by big
search engines to search among such huge data.

Thanks
-- 
View this message in context: http://www.nabble.com/Max-size-of-index--How-do-search-engines-avoid-this--tp23594241p23594241.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Max size of index? How do search engines avoid this?

Posted by Danil Ε’ORIN <to...@gmail.com>.
2GB size is a limitation of OS and/or file systems, not of the index
as supported by Lucene.
There is some other kind of limitation in Lucene: number of documents
< 2147483648
However the size of the lucene index may reach tens and hundreds of GB
way before that.

If you are thinking about BIG indexes, you should forget windows+fat32.

On linux with i've seen big indexes, like 80M of relatively small
documents, about 50Gb on disk
with reasonable performance (on pretty cheap machine)

If you need more documents, better performance, etc, you need to
partition your index into
several smaller indexes running on separate hosts, call them in
parallel and then merge results in a single resultset.

This way of operation is not "built-in" into Lucene, but you can
relativelly easy build a customized wrapper to do that.

AFAIK something simmilar powers google: each box handles about 10M
docs, there are thousands of boxes which do searches in parallel.

On Mon, May 18, 2009 at 12:42, raistlink <el...@gmail.com> wrote:
>
> Hi,
> I think I've read that there is a limit for de index, may be 2Gb for fat
> machines. If this is right I ask you for good resources (webs or books)
> about programming search engines to know about the techniques used by big
> search engines to search among such huge data.
>
> Thanks
> --
> View this message in context: http://www.nabble.com/Max-size-of-index--How-do-search-engines-avoid-this--tp23594241p23594241.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org