You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Doug Tarr <do...@mongodb.com.INVALID> on 2020/02/07 19:26:51 UTC

Info on document number limitations

Hi!

I'm working on a team that is building a lucene based search platform.
 I've been lurking on this list for a while as we are spooling up on
learning the various components of Lucene.  Thank you all for your amazing
work!

I'm interested in learning more about what work has been done around
document count limitations in the Lucene 8 codec (as described here
<http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/codecs/lucene80/package-summary.html>)
related to using int32 vs VInt or Int64:

"Lucene uses a Java int to refer to document numbers, and the index file
format uses an Int32 on-disk to store document numbers. This is a
limitation of both the index file format and the current implementation.
Eventually these should be replaced with either UInt64 values, or better
yet, VInt
<http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/store/DataOutput.html#writeVInt-int->
values
which have no limit."

I've looked through JIRA and couldn't find any discussions about it,
trade-offs, difficulties, etc.  If there's any information about this, I'd
appreciate any links or info that you might have.

Thanks!
- Doug
-- 


*{ *name     : *"Doug Tarr",*

  title    : "Director of Engineering, Search",

  location : "San Francisco, CA",

  company  : "MongoDB <http://www.mongodb.com>",

  email:   : "doug.tarr@mongodb.com",

  linkedin : "douglastarr <https://www.linkedin.com/in/douglastarr/>",

  twitter  : "@ <https://twitter.com/doug_tarr>*doug_tarr
<https://twitter.com/doug_tarr>" **}*

Re: Info on document number limitations

Posted by Adrien Grand <jp...@gmail.com>.

Lucene has a limit of 2^31-1-128 documents per index, see
IndexWriter.MAX_DOCS. Users don't often run into this limit but I've seen
it happen multiple times.

I think that it's unlikely that Lucene will ever remove this limit on a
per-segment basis, however there have been some discussions about having
the ability to go over this limit across multiple segments:
https://issues.apache.org/jira/browse/LUCENE-8321.

On Sun, Feb 9, 2020 at 2:29 PM Erick Erickson <er...@gmail.com>
wrote:

> Also, given how people use search, they hit performance issues long before
> running out of document IDs. Usually. Although that said I do know of one
> user who’s running in the 1.0-1.5B range per replica so 2B is just around
> the corner. Of course they have to be _very_ careful how they use Solr.
>
> And that said, there’s just not a lot of pressure to go to longs, and as
> Tim says it’s be a very significant effort. And there would be memory
> implications for everyone to balance.
>
> Best,
> Erick
>
> > On Feb 8, 2020, at 9:59 PM, Tim Casey <tc...@gmail.com> wrote:
> >
> >
> > Hi Doug,
> >
> > I don't know the specific limits.  But the document limits are going to
> be around an int, probably signed.  This comes out to mean about 2 billion
> documents per lucene index.  This is fairly embedded into the lucene code.
> The way the collective we have solved this is through forms of sharding.
> >
> > tim
> >
> > On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <do...@mongodb.com.invalid>
> wrote:
> > Hi!
> >
> > I'm working on a team that is building a lucene based search platform.
>  I've been lurking on this list for a while as we are spooling up on
> learning the various components of Lucene.  Thank you all for your amazing
> work!
> >
> > I'm interested in learning more about what work has been done around
> document count limitations in the Lucene 8 codec (as described here)
> related to using int32 vs VInt or Int64:
> >
> > "Lucene uses a Java int to refer to document numbers, and the index file
> format uses an Int32 on-disk to store document numbers. This is a
> limitation of both the index file format and the current implementation.
> Eventually these should be replaced with either UInt64 values, or better
> yet, VInt values which have no limit."
> >
> > I've looked through JIRA and couldn't find any discussions about it,
> trade-offs, difficulties, etc.  If there's any information about this, I'd
> appreciate any links or info that you might have.
> >
> > Thanks!
> > - Doug
> > --
> >
> > { name     : "Doug Tarr",
> >   title    : "Director of Engineering, Search",
> >   location : "San Francisco, CA",
> >   company  : "MongoDB",
> >   email:   : "doug.tarr@mongodb.com",
> >   linkedin : "douglastarr",
> >   twitter  : "@doug_tarr" }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
Adrien

Re: Info on document number limitations

Posted by Erick Erickson <er...@gmail.com>.

Also, given how people use search, they hit performance issues long before running out of document IDs. Usually. Although that said I do know of one user who’s running in the 1.0-1.5B range per replica so 2B is just around the corner. Of course they have to be _very_ careful how they use Solr.

And that said, there’s just not a lot of pressure to go to longs, and as Tim says it’s be a very significant effort. And there would be memory implications for everyone to balance.

Best,
Erick

> On Feb 8, 2020, at 9:59 PM, Tim Casey <tc...@gmail.com> wrote:
> 
> 
> Hi Doug,
> 
> I don't know the specific limits.  But the document limits are going to be around an int, probably signed.  This comes out to mean about 2 billion documents per lucene index.  This is fairly embedded into the lucene code.  The way the collective we have solved this is through forms of sharding.
> 
> tim
> 
> On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <do...@mongodb.com.invalid> wrote:
> Hi!
> 
> I'm working on a team that is building a lucene based search platform.   I've been lurking on this list for a while as we are spooling up on learning the various components of Lucene.  Thank you all for your amazing work!
> 
> I'm interested in learning more about what work has been done around document count limitations in the Lucene 8 codec (as described here) related to using int32 vs VInt or Int64:
> 
> "Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit."
> 
> I've looked through JIRA and couldn't find any discussions about it, trade-offs, difficulties, etc.  If there's any information about this, I'd appreciate any links or info that you might have.
> 
> Thanks!
> - Doug
> -- 
> 
> { name     : "Doug Tarr",
>   title    : "Director of Engineering, Search",
>   location : "San Francisco, CA", 
>   company  : "MongoDB",
>   email:   : "doug.tarr@mongodb.com",
>   linkedin : "douglastarr",
>   twitter  : "@doug_tarr" }


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Info on document number limitations

Posted by Tim Casey <tc...@gmail.com>.

Hi Doug,

I don't know the specific limits.  But the document limits are going to be
around an int, probably signed.  This comes out to mean about 2 billion
documents per lucene index.  This is fairly embedded into the lucene code.
The way the collective we have solved this is through forms of sharding.

tim

On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <do...@mongodb.com.invalid>
wrote:

> Hi!
>
> I'm working on a team that is building a lucene based search platform.
>  I've been lurking on this list for a while as we are spooling up on
> learning the various components of Lucene.  Thank you all for your amazing
> work!
>
> I'm interested in learning more about what work has been done around
> document count limitations in the Lucene 8 codec (as described here
> <http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/codecs/lucene80/package-summary.html>)
> related to using int32 vs VInt or Int64:
>
> "Lucene uses a Java int to refer to document numbers, and the index file
> format uses an Int32 on-disk to store document numbers. This is a
> limitation of both the index file format and the current implementation.
> Eventually these should be replaced with either UInt64 values, or better
> yet, VInt
> <http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/store/DataOutput.html#writeVInt-int-> values
> which have no limit."
>
> I've looked through JIRA and couldn't find any discussions about it,
> trade-offs, difficulties, etc.  If there's any information about this, I'd
> appreciate any links or info that you might have.
>
> Thanks!
> - Doug
> --
>
>
> *{ *name     : *"Doug Tarr",*
>
>   title    : "Director of Engineering, Search",
>
>   location : "San Francisco, CA",
>
>   company  : "MongoDB <http://www.mongodb.com>",
>
>   email:   : "doug.tarr@mongodb.com",
>
>   linkedin : "douglastarr <https://www.linkedin.com/in/douglastarr/>",
>
>   twitter  : "@ <https://twitter.com/doug_tarr>*doug_tarr
> <https://twitter.com/doug_tarr>" **}*
>