You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sebastin <se...@gmail.com> on 2008/04/30 05:54:07 UTC

Does Lucene Supports Billions of data

Hi All,
Does Lucene supports Billions of data in a single index store of size 14 GB
for every search.I have 3 Index Store of size 14 GB per index i need to
search these index store and retreive the result.it throws out of memory
problem while searching this index stores.
-- 
View this message in context: http://www.nabble.com/Does-Lucene-Supports-Billions-of-data-tp16974808p16974808.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does Lucene Supports Billions of data

Posted by John Wang <jo...@gmail.com>.
I am not sure why this is the case, docid is internal to the sub index. As
long as the sub index size is below 2 bil, there is no need for docid to be
long. With multiple indexes, I was thinking having an aggregater which
merges maybe only a page of search result.

Example:

sub index 1: 1 billion docs
sub index 2: 1 billion docs
sub index 3: 1 billion docs

federating search to these subindexes, you represent an index of 3 billion
docs, and all internal doc ids are of type int.

Maybe I am not understanding something.

-John

On Wed, Apr 30, 2008 at 4:10 PM, Daniel Noll <da...@nuix.com> wrote:

> On Thursday 01 May 2008 00:01:48 John Wang wrote:
> > I am not sure how well lucene would perform with > 2 Billion docs in a
> > single index anyway.
>
> Even if they're in multiple indexes, the doc IDs being ints will still
> prevent
> it going past 2Gi unless you wrap your own framework around it.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Does Lucene Supports Billions of data

Posted by Yonik Seeley <yo...@apache.org>.
On Wed, Apr 30, 2008 at 7:10 PM, Daniel Noll <da...@nuix.com> wrote:
> On Thursday 01 May 2008 00:01:48 John Wang wrote:
>  > I am not sure how well lucene would perform with > 2 Billion docs in a
>  > single index anyway.
>
>  Even if they're in multiple indexes, the doc IDs being ints will still prevent
>  it going past 2Gi unless you wrap your own framework around it.

Righ.
Solr's distributed search does use "long" where appropriate, and
should be able to scale past 2B docs.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Does Lucene Supports Billions of data

Posted by sp...@gmx.eu.
> Even if they're in multiple indexes, the doc IDs being ints 
> will still prevent 
> it going past 2Gi unless you wrap your own framework around it.

Hm. Does this mean that a MultiReader has the int-limit too?
I thought that this limit applies to a single index only...


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does Lucene Supports Billions of data

Posted by Daniel Noll <da...@nuix.com>.
On Thursday 01 May 2008 00:01:48 John Wang wrote:
> I am not sure how well lucene would perform with > 2 Billion docs in a
> single index anyway.

Even if they're in multiple indexes, the doc IDs being ints will still prevent 
it going past 2Gi unless you wrap your own framework around it.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does Lucene Supports Billions of data

Posted by Glen Newton <gl...@gmail.com>.
I understand. But it depends on implementation: if there are things in
Lucene that are O(n^2) or worse, then Moore's Law will not help with
large numbers. But if they are mostly O(n) or O(nlogn) on the large
numbers, then we can wait for bigger, faster, more cores to allow us
to use Lucene for billions of documents. You can go out now and buy a
64 dual core Sun SPARC box: which would likely scale better than any
network solution.

But of course as you point out, if we are going commodity hardware,
the Google distributed back-ends solution is the way to go......

-glen

2008/4/30 John Wang <jo...@gmail.com>:
> I am not sure how well lucene would perform with > 2 Billion docs in a
>  single index anyway.
>  I have posted a while ago about considering different ways of building
>  distributed search. A master-slave hierarchical model has been the norm, I
>  was hoping to see more of a system built on top of a Hadoop like
>  infrastructure where it is seamless to scale. Ning at IBM has written some
>  cool stuff into HBase for building index shards from an HBase table.
>
>  -John
>
>
>
>  On Wed, Apr 30, 2008 at 9:46 PM, Glen Newton <gl...@gmail.com> wrote:
>
>  > I have created Indexes with 1.5 billion documents.
>  >
>  > It was experimental: I took an index with 25 million documents, and
>  > merged it with itself many times. While not definitive as there were
>  > only 25m unique documents that were duplicated, it did prove that
>  > Lucene should be able to handle this number of (unique) documents.
>  >
>  > That said, Lucene needs to support >2B, so docids (and all associated
>  > internals) need to become 'long' fairly soon....
>  >
>  > -Glen
>  >
>  > 2008/4/30 John Wang <jo...@gmail.com>:
>  > > lucene docids are represented in a java int, so max signed int would be
>  > the
>  > >  limit, a little over 2 billion.
>  > >
>  > >  -John
>  > >
>  > >
>  > >
>  > >  On Wed, Apr 30, 2008 at 11:54 AM, Sebastin <se...@gmail.com>
>  > wrote:
>  > >
>  > >  >
>  > >  > Hi All,
>  > >  > Does Lucene supports Billions of data in a single index store of size
>  > 14
>  > >  > GB
>  > >  > for every search.I have 3 Index Store of size 14 GB per index i need
>  > to
>  > >  > search these index store and retreive the result.it throws out of
>  > memory
>  > >  > problem while searching this index stores.
>  > >  > --
>  > >  > View this message in context:
>  > >  >
>  > http://www.nabble.com/Does-Lucene-Supports-Billions-of-data-tp16974808p16974808.html
>  > >  > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>  > >  >
>  > >  >
>  > >  > ---------------------------------------------------------------------
>  > >  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>  > >  > For additional commands, e-mail: java-user-help@lucene.apache.org
>  > >  >
>  > >  >
>  > >
>  >
>  >
>  >
>  > --
>  >
>  > -
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>  > For additional commands, e-mail: java-user-help@lucene.apache.org
>  >
>  >
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does Lucene Supports Billions of data

Posted by John Wang <jo...@gmail.com>.
I am not sure how well lucene would perform with > 2 Billion docs in a
single index anyway.
I have posted a while ago about considering different ways of building
distributed search. A master-slave hierarchical model has been the norm, I
was hoping to see more of a system built on top of a Hadoop like
infrastructure where it is seamless to scale. Ning at IBM has written some
cool stuff into HBase for building index shards from an HBase table.

-John

On Wed, Apr 30, 2008 at 9:46 PM, Glen Newton <gl...@gmail.com> wrote:

> I have created Indexes with 1.5 billion documents.
>
> It was experimental: I took an index with 25 million documents, and
> merged it with itself many times. While not definitive as there were
> only 25m unique documents that were duplicated, it did prove that
> Lucene should be able to handle this number of (unique) documents.
>
> That said, Lucene needs to support >2B, so docids (and all associated
> internals) need to become 'long' fairly soon....
>
> -Glen
>
> 2008/4/30 John Wang <jo...@gmail.com>:
> > lucene docids are represented in a java int, so max signed int would be
> the
> >  limit, a little over 2 billion.
> >
> >  -John
> >
> >
> >
> >  On Wed, Apr 30, 2008 at 11:54 AM, Sebastin <se...@gmail.com>
> wrote:
> >
> >  >
> >  > Hi All,
> >  > Does Lucene supports Billions of data in a single index store of size
> 14
> >  > GB
> >  > for every search.I have 3 Index Store of size 14 GB per index i need
> to
> >  > search these index store and retreive the result.it throws out of
> memory
> >  > problem while searching this index stores.
> >  > --
> >  > View this message in context:
> >  >
> http://www.nabble.com/Does-Lucene-Supports-Billions-of-data-tp16974808p16974808.html
> >  > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >  >
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > For additional commands, e-mail: java-user-help@lucene.apache.org
> >  >
> >  >
> >
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Does Lucene Supports Billions of data

Posted by Glen Newton <gl...@gmail.com>.
I have created Indexes with 1.5 billion documents.

It was experimental: I took an index with 25 million documents, and
merged it with itself many times. While not definitive as there were
only 25m unique documents that were duplicated, it did prove that
Lucene should be able to handle this number of (unique) documents.

That said, Lucene needs to support >2B, so docids (and all associated
internals) need to become 'long' fairly soon....

-Glen

2008/4/30 John Wang <jo...@gmail.com>:
> lucene docids are represented in a java int, so max signed int would be the
>  limit, a little over 2 billion.
>
>  -John
>
>
>
>  On Wed, Apr 30, 2008 at 11:54 AM, Sebastin <se...@gmail.com> wrote:
>
>  >
>  > Hi All,
>  > Does Lucene supports Billions of data in a single index store of size 14
>  > GB
>  > for every search.I have 3 Index Store of size 14 GB per index i need to
>  > search these index store and retreive the result.it throws out of memory
>  > problem while searching this index stores.
>  > --
>  > View this message in context:
>  > http://www.nabble.com/Does-Lucene-Supports-Billions-of-data-tp16974808p16974808.html
>  > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>  >
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>  > For additional commands, e-mail: java-user-help@lucene.apache.org
>  >
>  >
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Does Lucene Supports Billions of data

Posted by John Wang <jo...@gmail.com>.
lucene docids are represented in a java int, so max signed int would be the
limit, a little over 2 billion.

-John

On Wed, Apr 30, 2008 at 11:54 AM, Sebastin <se...@gmail.com> wrote:

>
> Hi All,
> Does Lucene supports Billions of data in a single index store of size 14
> GB
> for every search.I have 3 Index Store of size 14 GB per index i need to
> search these index store and retreive the result.it throws out of memory
> problem while searching this index stores.
> --
> View this message in context:
> http://www.nabble.com/Does-Lucene-Supports-Billions-of-data-tp16974808p16974808.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>