You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sriram Sankar <sa...@gmail.com> on 2013/06/14 23:24:44 UTC

segments and sorting

Quick question on segments:

For my use case of having all docs sorted by a static rank and being able
to cut off retrieval after a certain number of docs, I have to sort all my
docs using the static rank (and Lucene 4 has a way to do this).

When an index has multiple segments, how does this sorting work?  Is each
segment sorted independently?  Or is it possible for me to control this -
and have a single segment?

Assuming I have a single segment, are there any other constraints?  I read
somewhere that FieldValue's have a limit of 2Gb per segment - is this true?

Thanks,

Sriram.

Re: segments and sorting

Posted by Sriram Sankar <sa...@gmail.com>.
Thanks.  If I end up doing it, we can try to get it in.

Sriram.


On Wed, Jun 19, 2013 at 1:10 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Wed, Jun 19, 2013 at 12:16 AM, Sriram Sankar <sa...@gmail.com> wrote:
> > Is it possible to do this more efficiently using a merge sort?  Assuming
> > the individual segments are already sorted, is there a wrapper that I can
> > use where I can pass the same sorting function?  I'm guessing the
> > SlowCompositeReaderWrapper does not assume that the individual segments
> are
> > already sorted and therefore would repeat the work?
>
> Given that online sorting is rather new to Lucene, we tried to keep it
> simple. Merging segments in parallel by maintaining a priority queue
> is totally doable and is probably one of the next steps for online
> sorting but it would require some non-trivial work to reimplement
> merging for all formats (postings lists especially) and to be able to
> plug a custom SegmentMerger into the IndexWriter.
>
> For now, we just make sure that sorting a SlowCompositeReaderWrapper
> which wraps several sorted segments is faster than sorting a random
> AtomicReader by using TimSort to compute the mapping between the old
> and the new doc IDs and to sort all individual postings lists.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: segments and sorting

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Wed, Jun 19, 2013 at 12:16 AM, Sriram Sankar <sa...@gmail.com> wrote:
> Is it possible to do this more efficiently using a merge sort?  Assuming
> the individual segments are already sorted, is there a wrapper that I can
> use where I can pass the same sorting function?  I'm guessing the
> SlowCompositeReaderWrapper does not assume that the individual segments are
> already sorted and therefore would repeat the work?

Given that online sorting is rather new to Lucene, we tried to keep it
simple. Merging segments in parallel by maintaining a priority queue
is totally doable and is probably one of the next steps for online
sorting but it would require some non-trivial work to reimplement
merging for all formats (postings lists especially) and to be able to
plug a custom SegmentMerger into the IndexWriter.

For now, we just make sure that sorting a SlowCompositeReaderWrapper
which wraps several sorted segments is faster than sorting a random
AtomicReader by using TimSort to compute the mapping between the old
and the new doc IDs and to sort all individual postings lists.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: segments and sorting

Posted by Sriram Sankar <sa...@gmail.com>.
> You can sort each segment independently or have a single segment, both
> options are available. To have a single segment, you just need to wrap
> your top-level index reader with SlowCompositeReaderWrapper before
> wrapping it again in a SortingAtomicReader and calling
> IndexWriter.addIndexes.

Is it possible to do this more efficiently using a merge sort?  Assuming
the individual segments are already sorted, is there a wrapper that I can
use where I can pass the same sorting function?  I'm guessing the
SlowCompositeReaderWrapper does not assume that the individual segments are
already sorted and therefore would repeat the work?

Thanks,

Sriram.



On Sat, Jun 15, 2013 at 1:52 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar <sa...@gmail.com> wrote:
> > For my use case of having all docs sorted by a static rank and being able
> > to cut off retrieval after a certain number of docs, I have to sort all
> my
> > docs using the static rank (and Lucene 4 has a way to do this).
> >
> > When an index has multiple segments, how does this sorting work?  Is each
> > segment sorted independently?  Or is it possible for me to control this -
> > and have a single segment?
>
> You can sort each segment independently or have a single segment, both
> options are available. To have a single segment, you just need to wrap
> your top-level index reader with SlowCompositeReaderWrapper before
> wrapping it again in a SortingAtomicReader and calling
> IndexWriter.addIndexes.
>
> > Assuming I have a single segment, are there any other constraints?  I
> read
> > somewhere that FieldValue's have a limit of 2Gb per segment - is this
> true?
>
> What do you mean with "FieldValue"? If you are referring to stored
> fields, a single field value cannot be larger than 2B because the API
> uses ints. But some codecs enforce lower limits, for example the
> current default stored fields format enforces that the sum of the
> sizes of all fields of a _single_ document is less than 2GB (which is
> already much more than what typical users need). I think the major
> limitation is that a single Lucene index cannot have more than 2
> billion documents, but you can store your data into several physical
> shards to work around this limitation and merge results at searching
> time.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: segments and sorting

Posted by Adrien Grand <jp...@gmail.com>.
On Tue, Jun 18, 2013 at 1:05 AM, Sriram Sankar <sa...@gmail.com> wrote:
> I'm sorry - I meant "DocValue" not "FieldValue".  Slide 20 in the following
> deck talks about the 2Gb limit.

Doc values don't have this limit anymore. However, there is a limit of
~32kb per term, but this shouldn't be a problem with reasonable
use-cases for doc values.

These slides are talking about the pre-4.0 API, and the doc values API
has been completely refactored in 4.2. Although the concepts are the
same, it may be non-trivial to translate the code examples to the new
API.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: segments and sorting

Posted by Sriram Sankar <sa...@gmail.com>.
I'm sorry - I meant "DocValue" not "FieldValue".  Slide 20 in the following
deck talks about the 2Gb limit.

Sriram.

http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene


On Sat, Jun 15, 2013 at 1:52 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar <sa...@gmail.com> wrote:
> > For my use case of having all docs sorted by a static rank and being able
> > to cut off retrieval after a certain number of docs, I have to sort all
> my
> > docs using the static rank (and Lucene 4 has a way to do this).
> >
> > When an index has multiple segments, how does this sorting work?  Is each
> > segment sorted independently?  Or is it possible for me to control this -
> > and have a single segment?
>
> You can sort each segment independently or have a single segment, both
> options are available. To have a single segment, you just need to wrap
> your top-level index reader with SlowCompositeReaderWrapper before
> wrapping it again in a SortingAtomicReader and calling
> IndexWriter.addIndexes.
>
> > Assuming I have a single segment, are there any other constraints?  I
> read
> > somewhere that FieldValue's have a limit of 2Gb per segment - is this
> true?
>
> What do you mean with "FieldValue"? If you are referring to stored
> fields, a single field value cannot be larger than 2B because the API
> uses ints. But some codecs enforce lower limits, for example the
> current default stored fields format enforces that the sum of the
> sizes of all fields of a _single_ document is less than 2GB (which is
> already much more than what typical users need). I think the major
> limitation is that a single Lucene index cannot have more than 2
> billion documents, but you can store your data into several physical
> shards to work around this limitation and merge results at searching
> time.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: segments and sorting

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Fri, Jun 14, 2013 at 11:24 PM, Sriram Sankar <sa...@gmail.com> wrote:
> For my use case of having all docs sorted by a static rank and being able
> to cut off retrieval after a certain number of docs, I have to sort all my
> docs using the static rank (and Lucene 4 has a way to do this).
>
> When an index has multiple segments, how does this sorting work?  Is each
> segment sorted independently?  Or is it possible for me to control this -
> and have a single segment?

You can sort each segment independently or have a single segment, both
options are available. To have a single segment, you just need to wrap
your top-level index reader with SlowCompositeReaderWrapper before
wrapping it again in a SortingAtomicReader and calling
IndexWriter.addIndexes.

> Assuming I have a single segment, are there any other constraints?  I read
> somewhere that FieldValue's have a limit of 2Gb per segment - is this true?

What do you mean with "FieldValue"? If you are referring to stored
fields, a single field value cannot be larger than 2B because the API
uses ints. But some codecs enforce lower limits, for example the
current default stored fields format enforces that the sum of the
sizes of all fields of a _single_ document is less than 2GB (which is
already much more than what typical users need). I think the major
limitation is that a single Lucene index cannot have more than 2
billion documents, but you can store your data into several physical
shards to work around this limitation and merge results at searching
time.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org