You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Israel Ekpo <is...@gmail.com> on 2010/10/11 07:24:31 UTC

Using long instead of int for docIds

Hi Solr Devs,

I have always had this question at the back of my mind and I would love to
know the answers to a couple of questions.

1. Does using int for document ids place any restrictions on the number of
documents that can be stored in a single index? I am assuming we cannot go
beyond 2 to power 31 minus 1 documents but I have not actually test this
yet.

2. What would it take to change the core to use long instead of int for
document ids?

3. Would there be any practical gains or benefits of making such a change?

I initially wanted to send this question to the Stomp the Chomp challenge
but I figured it would be better to open it to all.

Any useful feedbacks will be highly appreciated.

-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Using long instead of int for docIds

Posted by Israel Ekpo <is...@gmail.com>.
Excellent response guys.

Thanks a lot for the input.

On Tue, Oct 12, 2010 at 9:13 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Tue, Oct 12, 2010 at 8:19 AM, eks dev <ek...@yahoo.co.uk> wrote:
> > --- the practical limit for a single lucene index is ~100M docs anyway
> ---
> >
> > I do not see it that way, there are very practical cases (short
> documents)
> > with 250M docs and  sub-second response times :)
> > And I believe it can be pushed even further, especially when flex branch
> > stabilizes
>
> Of course... I've seen much bigger single indexes myself.
> But I think it's a good ballpark for a "practical" upper bound when
> you need to give one to people w/o further knowledge of their problem.
>  People are normally better off sharding at that level, and often far
> below that level, depending on the complexity of the queries,
> faceting, etc).
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Using long instead of int for docIds

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Oct 12, 2010 at 8:19 AM, eks dev <ek...@yahoo.co.uk> wrote:
> --- the practical limit for a single lucene index is ~100M docs anyway ---
>
> I do not see it that way, there are very practical cases (short documents)
> with 250M docs and  sub-second response times :)
> And I believe it can be pushed even further, especially when flex branch
> stabilizes

Of course... I've seen much bigger single indexes myself.
But I think it's a good ballpark for a "practical" upper bound when
you need to give one to people w/o further knowledge of their problem.
 People are normally better off sharding at that level, and often far
below that level, depending on the complexity of the queries,
faceting, etc).

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Using long instead of int for docIds

Posted by eks dev <ek...@yahoo.co.uk>.
--- the practical limit for a single lucene index is ~100M docs anyway ---

I do not see it that way, there are very practical cases (short documents)
with 250M docs and  sub-second response times :)
And I believe it can be pushed even further, especially when flex branch
stabilizes

Changes nothing on your int/long point, just doing the justice to Lucene

Cheers,
eks

On Tue, Oct 12, 2010 at 1:01 PM, Israel Ekpo <is...@gmail.com> wrote:

> Thanks Yonik for responding.
>
> This clarifies a lot.
>
> On Mon, Oct 11, 2010 at 11:11 PM, Yonik Seeley <yonik@lucidimagination.com
> > wrote:
>
>> I think ints instead of longs for docids is still the best practical
>> choice for today.
>> - longs double the size it takes to store collected ids
>> - Java native arrays are indexed by int (hence we couldn't collect
>> more than 2B matches easily anyway)
>> - the practical limit for a single lucene index is ~100M docs anyway
>>
>> But, perhaps MultiSearcher (or a new class called BigMultiSearcher)
>> should start using longs.
>>
>> -Yonik
>>
>> On Mon, Oct 11, 2010 at 1:24 AM, Israel Ekpo <is...@gmail.com>
>> wrote:
>> > Hi Solr Devs,
>> >
>> > I have always had this question at the back of my mind and I would love
>> to
>> > know the answers to a couple of questions.
>> >
>> > 1. Does using int for document ids place any restrictions on the number
>> of
>> > documents that can be stored in a single index? I am assuming we cannot
>> go
>> > beyond 2 to power 31 minus 1 documents but I have not actually test this
>> > yet.
>> >
>> > 2. What would it take to change the core to use long instead of int for
>> > document ids?
>> >
>> > 3. Would there be any practical gains or benefits of making such a
>> change?
>> >
>> > I initially wanted to send this question to the Stomp the Chomp
>> challenge
>> > but I figured it would be better to open it to all.
>> >
>> > Any useful feedbacks will be highly appreciated.
>> >
>> > --
>> > °O°
>> > "Good Enough" is not good enough.
>> > To give anything less than your best is to sacrifice the gift.
>> > Quality First. Measure Twice. Cut Once.
>> > http://www.israelekpo.com/
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>
> --
> °O°
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>

Re: Using long instead of int for docIds

Posted by Israel Ekpo <is...@gmail.com>.
Thanks Yonik for responding.

This clarifies a lot.

On Mon, Oct 11, 2010 at 11:11 PM, Yonik Seeley
<yo...@lucidimagination.com>wrote:

> I think ints instead of longs for docids is still the best practical
> choice for today.
> - longs double the size it takes to store collected ids
> - Java native arrays are indexed by int (hence we couldn't collect
> more than 2B matches easily anyway)
> - the practical limit for a single lucene index is ~100M docs anyway
>
> But, perhaps MultiSearcher (or a new class called BigMultiSearcher)
> should start using longs.
>
> -Yonik
>
> On Mon, Oct 11, 2010 at 1:24 AM, Israel Ekpo <is...@gmail.com> wrote:
> > Hi Solr Devs,
> >
> > I have always had this question at the back of my mind and I would love
> to
> > know the answers to a couple of questions.
> >
> > 1. Does using int for document ids place any restrictions on the number
> of
> > documents that can be stored in a single index? I am assuming we cannot
> go
> > beyond 2 to power 31 minus 1 documents but I have not actually test this
> > yet.
> >
> > 2. What would it take to change the core to use long instead of int for
> > document ids?
> >
> > 3. Would there be any practical gains or benefits of making such a
> change?
> >
> > I initially wanted to send this question to the Stomp the Chomp challenge
> > but I figured it would be better to open it to all.
> >
> > Any useful feedbacks will be highly appreciated.
> >
> > --
> > °O°
> > "Good Enough" is not good enough.
> > To give anything less than your best is to sacrifice the gift.
> > Quality First. Measure Twice. Cut Once.
> > http://www.israelekpo.com/
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: Using long instead of int for docIds

Posted by Yonik Seeley <yo...@lucidimagination.com>.
I think ints instead of longs for docids is still the best practical
choice for today.
- longs double the size it takes to store collected ids
- Java native arrays are indexed by int (hence we couldn't collect
more than 2B matches easily anyway)
- the practical limit for a single lucene index is ~100M docs anyway

But, perhaps MultiSearcher (or a new class called BigMultiSearcher)
should start using longs.

-Yonik

On Mon, Oct 11, 2010 at 1:24 AM, Israel Ekpo <is...@gmail.com> wrote:
> Hi Solr Devs,
>
> I have always had this question at the back of my mind and I would love to
> know the answers to a couple of questions.
>
> 1. Does using int for document ids place any restrictions on the number of
> documents that can be stored in a single index? I am assuming we cannot go
> beyond 2 to power 31 minus 1 documents but I have not actually test this
> yet.
>
> 2. What would it take to change the core to use long instead of int for
> document ids?
>
> 3. Would there be any practical gains or benefits of making such a change?
>
> I initially wanted to send this question to the Stomp the Chomp challenge
> but I figured it would be better to open it to all.
>
> Any useful feedbacks will be highly appreciated.
>
> --
> °O°
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org