You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ray <ra...@rayweb.de> on 2008/03/06 16:23:11 UTC

MultiSearcher to overcome the Integer.MAX_VALUE limit

Hey Guys,

just a quick question to confirm an assumption I have.

Is it correct that I can have around 100 Indexes each at its
Integer.MAX_VALUE limit of documents, but can happily
search  them all with a MultiSearcher if all combined returned
hits don't add up to the Integer.MAX_VALUE themselves ?

Kind regards,

Ray.

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Erick Erickson <er...@gmail.com>.

Well, I really don't have a clue what'll happen with that many
documents. It's more a matter of unique terms from what I
understand.

I'll be *really* curious how it turns out.

Erick

On Thu, Mar 6, 2008 at 6:03 PM, Ray <ra...@rayweb.de> wrote:

>
> Thanks for your answer.
>
> Well I want to search around  6 billion documents.
> Most of them very small, but I am confident to be hitting
> that number in the long run.
>
> I am currently running a small random text indexer with 400 docs/second.
> It will reach 2 billion in around 45 days.
>
> I really hope you all who are saying 2 billion docs
> will bring lucene to its knees are wrong...
>
> Ray.
>
> ----- Original Message -----
> From: "Erick Erickson" <er...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Thursday, March 06, 2008 10:40 PM
> Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit
>
>
> > Well, I'm not sure. But any index, even one split amongst many nodes
> > is going to have some interesting performance characteristics if you
> > have over 2 billion documents.... So I'm not sure it matters <G>...
> >
> > What problem are you really trying to solve? You'll probably get
> > more meaningful answers if you tell us what that is.
> >
> > Best
> > Erick
> >
> > On Thu, Mar 6, 2008 at 10:23 AM, Ray <ra...@rayweb.de> wrote:
> >
> >> Hey Guys,
> >>
> >> just a quick question to confirm an assumption I have.
> >>
> >> Is it correct that I can have around 100 Indexes each at its
> >> Integer.MAX_VALUE limit of documents, but can happily
> >> search  them all with a MultiSearcher if all combined returned
> >> hits don't add up to the Integer.MAX_VALUE themselves ?
> >>
> >> Kind regards,
> >>
> >> Ray.
> >>
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by sp...@gmx.eu.

> Right...  but trust me, you really wouldn't want to.  You need
> distributed search at that level anyway.

Hm, 2 billion small docs are not so much.
Why do I need distributed search and what exactly do you means with
distributed search? Multiple IndexSearchers? Multiple processes? Multiple
machines?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Yonik Seeley <yo...@apache.org>.

On Sat, Mar 8, 2008 at 2:06 PM,  <sp...@gmx.eu> wrote:
> Does this mean that I cannot search indexes with more than 2 billion docs at
>  all with a single IndexSearcher?

Right...  but trust me, you really wouldn't want to.  You need
distributed search at that level anyway.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by sp...@gmx.eu.

Does this mean that I cannot search indexes with more than 2 billion docs at
all with a single IndexSearcher? 

> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com] 
> Sent: Samstag, 8. März 2008 18:57
> To: java-user@lucene.apache.org
> Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit
> 
> Random text can often be pretty slow when done per word.
> 
> I think you will have to modify the MultiSearcher a bit. The 
> MultiSearcher takes a global id space and converts to and from an 
> individual Searcher id space. The MultiSearcher's id space is 
> limited to 
> an int as well, but I think if you change it to a float/double, you 
> should be all set.
> 
> - Mark
> 
> Toke Eskildsen wrote:
> > On Fri, 2008-03-07 at 00:03 +0100, Ray wrote:
> >    
> >> I am currently running a small random text indexer with 
> 400 docs/second.
> >> It will reach 2 billion in around 45 days.
> >>      
> >
> > If you are just doing it to test large indexes (in terms of document
> > count), then you need to look into your index-generation 
> code. I tried
> > making an ultra-simple index builder, where each document contains a
> > unique id and one of nine fixed strings. The index-building 
> speed on my
> > desktop computer is 40.000 documents/second (tested with 100 million
> > documents).
> >
> > I would suspect that your random text generator is where all the
> > time-intensive processing occurs. Either that or you're 
> flushing after
> > each document addition (which lowers my execution speed to about 100
> > documents/second).
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >    
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Mark Miller <ma...@gmail.com>.

Random text can often be pretty slow when done per word.

I think you will have to modify the MultiSearcher a bit. The 
MultiSearcher takes a global id space and converts to and from an 
individual Searcher id space. The MultiSearcher's id space is limited to 
an int as well, but I think if you change it to a float/double, you 
should be all set.

- Mark

Toke Eskildsen wrote:
> On Fri, 2008-03-07 at 00:03 +0100, Ray wrote:
>    
>> I am currently running a small random text indexer with 400 docs/second.
>> It will reach 2 billion in around 45 days.
>>      
>
> If you are just doing it to test large indexes (in terms of document
> count), then you need to look into your index-generation code. I tried
> making an ultra-simple index builder, where each document contains a
> unique id and one of nine fixed strings. The index-building speed on my
> desktop computer is 40.000 documents/second (tested with 100 million
> documents).
>
> I would suspect that your random text generator is where all the
> time-intensive processing occurs. Either that or you're flushing after
> each document addition (which lowers my execution speed to about 100
> documents/second).
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Fri, 2008-03-07 at 00:03 +0100, Ray wrote:
> I am currently running a small random text indexer with 400 docs/second.
> It will reach 2 billion in around 45 days.

If you are just doing it to test large indexes (in terms of document
count), then you need to look into your index-generation code. I tried
making an ultra-simple index builder, where each document contains a
unique id and one of nine fixed strings. The index-building speed on my
desktop computer is 40.000 documents/second (tested with 100 million
documents).

I would suspect that your random text generator is where all the
time-intensive processing occurs. Either that or you're flushing after
each document addition (which lowers my execution speed to about 100
documents/second).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Ray <ra...@rayweb.de>.

Thanks for your answer.

Well I want to search around  6 billion documents.
Most of them very small, but I am confident to be hitting
that number in the long run.

I am currently running a small random text indexer with 400 docs/second.
It will reach 2 billion in around 45 days.

I really hope you all who are saying 2 billion docs
will bring lucene to its knees are wrong... 

Ray.

----- Original Message ----- 
From: "Erick Erickson" <er...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Thursday, March 06, 2008 10:40 PM
Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit


> Well, I'm not sure. But any index, even one split amongst many nodes
> is going to have some interesting performance characteristics if you
> have over 2 billion documents.... So I'm not sure it matters <G>...
> 
> What problem are you really trying to solve? You'll probably get
> more meaningful answers if you tell us what that is.
> 
> Best
> Erick
> 
> On Thu, Mar 6, 2008 at 10:23 AM, Ray <ra...@rayweb.de> wrote:
> 
>> Hey Guys,
>>
>> just a quick question to confirm an assumption I have.
>>
>> Is it correct that I can have around 100 Indexes each at its
>> Integer.MAX_VALUE limit of documents, but can happily
>> search  them all with a MultiSearcher if all combined returned
>> hits don't add up to the Integer.MAX_VALUE themselves ?
>>
>> Kind regards,
>>
>> Ray.
>>
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

Posted by Erick Erickson <er...@gmail.com>.

Well, I'm not sure. But any index, even one split amongst many nodes
is going to have some interesting performance characteristics if you
have over 2 billion documents.... So I'm not sure it matters <G>...

What problem are you really trying to solve? You'll probably get
more meaningful answers if you tell us what that is.

Best
Erick

On Thu, Mar 6, 2008 at 10:23 AM, Ray <ra...@rayweb.de> wrote:

> Hey Guys,
>
> just a quick question to confirm an assumption I have.
>
> Is it correct that I can have around 100 Indexes each at its
> Integer.MAX_VALUE limit of documents, but can happily
> search  them all with a MultiSearcher if all combined returned
> hits don't add up to the Integer.MAX_VALUE themselves ?
>
> Kind regards,
>
> Ray.
>
>
>