You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Arvind Kalyan <ba...@gmail.com> on 2013/10/23 19:13:25 UTC

Merging ordered segments without re-sorting.

Hi there, I'm looking for pointers, suggestions on how to approach this in
Lucene 4.5.

Say I am creating an index using a sequence of addDocument() calls and end
up with segments that each contain documents in a specified ordering. It is
guaranteed that there won't be updates/deletes/reads etc happening on the
index -- this is an offline index building task for a read-only index.

I create the index in the above mentioned fashion
using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a single
segment in the ordering I want.

Now my requirement is that I need to be able to merge this single segment
with another such segment (say from yesterday's index) and guarantee some
ordering -- say I have a comparator which looks at some field values in the
2 given docs and defines the ordering.

Index 1 with segment X:
(a,1)
(b,2)
(e,10)

Index 2 (say from yesterday) with some segment Y:
(c,4)
(d,6)

Essentially we have 2 ordered segments, and I'm looking to 'merge' them
(literally) using the value of some field, without having to re-sort them
which would be too time & resource consuming.

Output Index, with some segment Z:
(a,1)
(b,2)
(c,4)
(d,6)
(e,10)

Is this already possible? If not, any tips on how I can approach
implementing this requirement?

Thanks,

-- 
Arvind Kalyan

Re: Merging ordered segments without re-sorting.

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Thu, Oct 24, 2013 at 12:20 AM, Arvind Kalyan <ba...@gmail.com> wrote:
> I will benchmark the available approach itself then, in that case. Will
> revert back if the performance in unacceptable.

For the record, last time I checked, indexing was 2x slower on average
on a 10M document collection (see
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13605896&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13605896)
and most of the time was not actually spent in sorting the doc IDs but
merging stored fields (because we by-pass the specialized sequential
merging impl which is usually used when merging segment without
sorting).

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merging ordered segments without re-sorting.

Posted by Arvind Kalyan <ba...@gmail.com>.
On Wed, Oct 23, 2013 at 2:45 PM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan <ba...@gmail.com> wrote:
> > Sorting is not an option for our case so we will most likely implement a
> > variant that merges the segments in one pass. Using TimSort is great but
> in
> > our case the 2 segments will be highly interspersed and would not benefit
> > from the galloping in TimSort.
>
> I think Shai didn't mention TimSort because of galloping, but because
> it tries to discover the monotonic sequences in your data. In your
> particular case, this means that the sorting would run in linear time
> instead of O(n log(n)) given that the 2 segments to merge are sorted.
>


OK, I went back and read about TimSort and it does appear to be optimized
for this use-case. Apologize for my ignorance :-)

I will benchmark the available approach itself then, in that case. Will
revert back if the performance in unacceptable.

Thanks.

Re: Merging ordered segments without re-sorting.

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Wed, Oct 23, 2013 at 10:19 PM, Arvind Kalyan <ba...@gmail.com> wrote:
> Sorting is not an option for our case so we will most likely implement a
> variant that merges the segments in one pass. Using TimSort is great but in
> our case the 2 segments will be highly interspersed and would not benefit
> from the galloping in TimSort.

I think Shai didn't mention TimSort because of galloping, but because
it tries to discover the monotonic sequences in your data. In your
particular case, this means that the sorting would run in linear time
instead of O(n log(n)) given that the 2 segments to merge are sorted.

> In additional, if anyone else on the list has any inputs on implementing
> the merge (without sort) I'd appreciate it as well! More than likely I'll
> have followup questions if we decide to go this route.

The tricky part is postings merging. You need to be able to remap your
doc ids to their new value in the merged segment and this requires
sorting the doc ids.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Merging ordered segments without re-sorting.

Posted by Arvind Kalyan <ba...@gmail.com>.
Thanks again.

Sorting is not an option for our case so we will most likely implement a
variant that merges the segments in one pass. Using TimSort is great but in
our case the 2 segments will be highly interspersed and would not benefit
from the galloping in TimSort.

In additional, if anyone else on the list has any inputs on implementing
the merge (without sort) I'd appreciate it as well! More than likely I'll
have followup questions if we decide to go this route.


On Wed, Oct 23, 2013 at 12:56 PM, Shai Erera <se...@gmail.com> wrote:

> SortingAtomicReader uses the TimSort algorithm, which performs well when
> the two segments are already sorted.
> Anyway, that's the way to do it, even if it looks like it does more work
> than it should.
>
> Shai
>
>
> On Wed, Oct 23, 2013 at 10:46 PM, Arvind Kalyan <ba...@gmail.com> wrote:
>
> > Thanks, my understanding is that SortingMergePolicy performs sorting
> after
> > wrapping the 2 segments, correct?
> >
> > As I mentioned in my original email I would like to avoid the re-sorting
> > and exploit the fact that the input segments are already sorted.
> >
> >
> >
> > On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera <se...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > You can use SortingMergePolicy and SortingAtomicReader to achieve that.
> > You
> > > can read more about index sorting here:
> > > http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
> > >
> > > Shai
> > >
> > >
> > > On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan <ba...@gmail.com>
> wrote:
> > >
> > > > Hi there, I'm looking for pointers, suggestions on how to approach
> this
> > > in
> > > > Lucene 4.5.
> > > >
> > > > Say I am creating an index using a sequence of addDocument() calls
> and
> > > end
> > > > up with segments that each contain documents in a specified ordering.
> > It
> > > is
> > > > guaranteed that there won't be updates/deletes/reads etc happening on
> > the
> > > > index -- this is an offline index building task for a read-only
> index.
> > > >
> > > > I create the index in the above mentioned fashion
> > > > using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a
> > > single
> > > > segment in the ordering I want.
> > > >
> > > > Now my requirement is that I need to be able to merge this single
> > segment
> > > > with another such segment (say from yesterday's index) and guarantee
> > some
> > > > ordering -- say I have a comparator which looks at some field values
> in
> > > the
> > > > 2 given docs and defines the ordering.
> > > >
> > > > Index 1 with segment X:
> > > > (a,1)
> > > > (b,2)
> > > > (e,10)
> > > >
> > > > Index 2 (say from yesterday) with some segment Y:
> > > > (c,4)
> > > > (d,6)
> > > >
> > > > Essentially we have 2 ordered segments, and I'm looking to 'merge'
> them
> > > > (literally) using the value of some field, without having to re-sort
> > them
> > > > which would be too time & resource consuming.
> > > >
> > > > Output Index, with some segment Z:
> > > > (a,1)
> > > > (b,2)
> > > > (c,4)
> > > > (d,6)
> > > > (e,10)
> > > >
> > > > Is this already possible? If not, any tips on how I can approach
> > > > implementing this requirement?
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Arvind Kalyan
> > > >
> > >
> >
> >
> >
> > --
> > Arvind Kalyan
> > http://www.linkedin.com/in/base16
> > cell: (408) 761-2030
> >
>



-- 
Arvind Kalyan
http://www.linkedin.com/in/base16
cell: (408) 761-2030

Re: Merging ordered segments without re-sorting.

Posted by Shai Erera <se...@gmail.com>.
SortingAtomicReader uses the TimSort algorithm, which performs well when
the two segments are already sorted.
Anyway, that's the way to do it, even if it looks like it does more work
than it should.

Shai


On Wed, Oct 23, 2013 at 10:46 PM, Arvind Kalyan <ba...@gmail.com> wrote:

> Thanks, my understanding is that SortingMergePolicy performs sorting after
> wrapping the 2 segments, correct?
>
> As I mentioned in my original email I would like to avoid the re-sorting
> and exploit the fact that the input segments are already sorted.
>
>
>
> On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera <se...@gmail.com> wrote:
>
> > Hi
> >
> > You can use SortingMergePolicy and SortingAtomicReader to achieve that.
> You
> > can read more about index sorting here:
> > http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
> >
> > Shai
> >
> >
> > On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan <ba...@gmail.com> wrote:
> >
> > > Hi there, I'm looking for pointers, suggestions on how to approach this
> > in
> > > Lucene 4.5.
> > >
> > > Say I am creating an index using a sequence of addDocument() calls and
> > end
> > > up with segments that each contain documents in a specified ordering.
> It
> > is
> > > guaranteed that there won't be updates/deletes/reads etc happening on
> the
> > > index -- this is an offline index building task for a read-only index.
> > >
> > > I create the index in the above mentioned fashion
> > > using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a
> > single
> > > segment in the ordering I want.
> > >
> > > Now my requirement is that I need to be able to merge this single
> segment
> > > with another such segment (say from yesterday's index) and guarantee
> some
> > > ordering -- say I have a comparator which looks at some field values in
> > the
> > > 2 given docs and defines the ordering.
> > >
> > > Index 1 with segment X:
> > > (a,1)
> > > (b,2)
> > > (e,10)
> > >
> > > Index 2 (say from yesterday) with some segment Y:
> > > (c,4)
> > > (d,6)
> > >
> > > Essentially we have 2 ordered segments, and I'm looking to 'merge' them
> > > (literally) using the value of some field, without having to re-sort
> them
> > > which would be too time & resource consuming.
> > >
> > > Output Index, with some segment Z:
> > > (a,1)
> > > (b,2)
> > > (c,4)
> > > (d,6)
> > > (e,10)
> > >
> > > Is this already possible? If not, any tips on how I can approach
> > > implementing this requirement?
> > >
> > > Thanks,
> > >
> > > --
> > > Arvind Kalyan
> > >
> >
>
>
>
> --
> Arvind Kalyan
> http://www.linkedin.com/in/base16
> cell: (408) 761-2030
>

Re: Merging ordered segments without re-sorting.

Posted by Arvind Kalyan <ba...@gmail.com>.
Thanks, my understanding is that SortingMergePolicy performs sorting after
wrapping the 2 segments, correct?

As I mentioned in my original email I would like to avoid the re-sorting
and exploit the fact that the input segments are already sorted.



On Wed, Oct 23, 2013 at 11:02 AM, Shai Erera <se...@gmail.com> wrote:

> Hi
>
> You can use SortingMergePolicy and SortingAtomicReader to achieve that. You
> can read more about index sorting here:
> http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
>
> Shai
>
>
> On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan <ba...@gmail.com> wrote:
>
> > Hi there, I'm looking for pointers, suggestions on how to approach this
> in
> > Lucene 4.5.
> >
> > Say I am creating an index using a sequence of addDocument() calls and
> end
> > up with segments that each contain documents in a specified ordering. It
> is
> > guaranteed that there won't be updates/deletes/reads etc happening on the
> > index -- this is an offline index building task for a read-only index.
> >
> > I create the index in the above mentioned fashion
> > using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a
> single
> > segment in the ordering I want.
> >
> > Now my requirement is that I need to be able to merge this single segment
> > with another such segment (say from yesterday's index) and guarantee some
> > ordering -- say I have a comparator which looks at some field values in
> the
> > 2 given docs and defines the ordering.
> >
> > Index 1 with segment X:
> > (a,1)
> > (b,2)
> > (e,10)
> >
> > Index 2 (say from yesterday) with some segment Y:
> > (c,4)
> > (d,6)
> >
> > Essentially we have 2 ordered segments, and I'm looking to 'merge' them
> > (literally) using the value of some field, without having to re-sort them
> > which would be too time & resource consuming.
> >
> > Output Index, with some segment Z:
> > (a,1)
> > (b,2)
> > (c,4)
> > (d,6)
> > (e,10)
> >
> > Is this already possible? If not, any tips on how I can approach
> > implementing this requirement?
> >
> > Thanks,
> >
> > --
> > Arvind Kalyan
> >
>



-- 
Arvind Kalyan
http://www.linkedin.com/in/base16
cell: (408) 761-2030

Re: Merging ordered segments without re-sorting.

Posted by Shai Erera <se...@gmail.com>.
Hi

You can use SortingMergePolicy and SortingAtomicReader to achieve that. You
can read more about index sorting here:
http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html

Shai


On Wed, Oct 23, 2013 at 8:13 PM, Arvind Kalyan <ba...@gmail.com> wrote:

> Hi there, I'm looking for pointers, suggestions on how to approach this in
> Lucene 4.5.
>
> Say I am creating an index using a sequence of addDocument() calls and end
> up with segments that each contain documents in a specified ordering. It is
> guaranteed that there won't be updates/deletes/reads etc happening on the
> index -- this is an offline index building task for a read-only index.
>
> I create the index in the above mentioned fashion
> using LogByteSizeMergePolicy and finally do a forceMerge(1) to get a single
> segment in the ordering I want.
>
> Now my requirement is that I need to be able to merge this single segment
> with another such segment (say from yesterday's index) and guarantee some
> ordering -- say I have a comparator which looks at some field values in the
> 2 given docs and defines the ordering.
>
> Index 1 with segment X:
> (a,1)
> (b,2)
> (e,10)
>
> Index 2 (say from yesterday) with some segment Y:
> (c,4)
> (d,6)
>
> Essentially we have 2 ordered segments, and I'm looking to 'merge' them
> (literally) using the value of some field, without having to re-sort them
> which would be too time & resource consuming.
>
> Output Index, with some segment Z:
> (a,1)
> (b,2)
> (c,4)
> (d,6)
> (e,10)
>
> Is this already possible? If not, any tips on how I can approach
> implementing this requirement?
>
> Thanks,
>
> --
> Arvind Kalyan
>