You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ravikumar Govindarajan <ra...@gmail.com> on 2015/04/14 14:07:34 UTC

SortingAtomicReader alternate to Tim-Sort...

We were experimenting with SortingMergePolicy and came across an alternate
solution to TimSort of postings-list using FBS & GrowableWriter.

I have attached relevant code-snippet. It would be nice if someone can
clarify whether it is a good idea to implement...

public class SortingAtomicReader {
…
…
class SortingDocsEnum {

//Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
//constructor. It is assumed that these 2 variables are init during
//merge start & they are then re-used till merge completes...


public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
olddocToFreq) throws IOException {

….

…

while (true) {

  //Instead of Tim-Sorting as in existing code

  doc = in.nextDoc();

  int newdoc = docMap.oldToNew(doc);

  newdoclist.set(newdoc);

  if(withFreqs) {

    olddocToFreq.set(doc, in.freq());

  }

}


@Override

public int nextDoc() throws IOException {

  if (++docIt >= upto) {

  return NO_MORE_DOCS;

  }

  currDoc = newdoclist.nextSetBit(++currDoc);

  if(currDoc == -1) {

    return NO_MORE_DOCS;

  }

  //clear the set-bit here before returning...

  newdoclist.clear(currDoc);

  return currDoc;

}


@Override

public int freq() throws IOException {

  if(withFreqs && docIt < upto) {

  return (int)olddocToFreq.getMutable()

                 .get(docMap.newToOld(currDoc));

  }

  return 1;

}

}

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Thanks. Glad that it has been pro-actively identified and fixed

--
Ravi

On Thu, Apr 23, 2015 at 10:34 AM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Apr 21, 2015 at 4:00 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
>
> > b) CompressingStoredFieldsReader did not store the last decoded 32KB
> chunk.
> > Our segments are already sorted before participating in a merge. On
> mostly
> > linear merge, we ended up decoding the same chunk again and again. Simply
> > storing the last chunk resulted in good speed-ups for us...
>
> See also https://issues.apache.org/jira/browse/LUCENE-6131 where this
> is solved for 5.0.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Apr 21, 2015 at 4:00 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:

> b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
> Our segments are already sorted before participating in a merge. On mostly
> linear merge, we ended up decoding the same chunk again and again. Simply
> storing the last chunk resulted in good speed-ups for us...

See also https://issues.apache.org/jira/browse/LUCENE-6131 where this
is solved for 5.0.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Adrien Grand <jp...@gmail.com>.

Sorry for the delay, I opened
https://issues.apache.org/jira/browse/LUCENE-6469. It can go to trunk
and 5.x (the value of x depending on when it's ready :)).

On Thu, Apr 30, 2015 at 9:02 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
>>
>> Would you like to submit a patch that changes SortingMergePolicy to
>> use the approach that you are proposing using bitsets instead of
>> sorting int[] arrays?
>
>
> Sure can do that. Can you open a ticket for this, as I don't know what
> versions this can go in?
>
> --
> Ravi
>
>
>
> On Tue, Apr 28, 2015 at 6:03 PM, Adrien Grand <jp...@gmail.com> wrote:
>
>> On Tue, Apr 21, 2015 at 10:00 AM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > Thanks for the comments…
>> >
>> > My only
>> >> concern about using the FixedBitSet is that it would make sorting each
>> >> postings list run in O(maxDoc) but maybe we can make it better by
>> >> using SparseFixedBitSet
>> >
>> >
>> > Yes I was also thinking about this. But we are on 4.x and did not take
>> the
>> > plunge. But as you said, it should be a good idea to test on SFBS
>>
>> Would you like to submit a patch that changes SortingMergePolicy to
>> use the approach that you are proposing using bitsets instead of
>> sorting int[] arrays?
>>
>> > I'm curious if you already performed any kind of benchmarking of this
>> >> approach?
>> >
>> >
>> > Yes we did a stress test of sorts aimed at SortingMergePolicy. We made
>> most
>> > of our data as RAM resident and then CPU hot-spots came up...
>> >
>> > There were few take-aways from the test. I am listing down few of them..
>> > It's kind of lengthy. Please read through...
>> >
>> > a) Postings-List issue, as discussed above…
>> >
>> > b) CompressingStoredFieldsReader did not store the last decoded 32KB
>> chunk.
>> > Our segments are already sorted before participating in a merge. On
>> mostly
>> > linear merge, we ended up decoding the same chunk again and again. Simply
>> > storing the last chunk resulted in good speed-ups for us...
>> >
>> > c) Once above steps were corrected, the CPU hotspot shifted to
>> > BlockDocsEnum. Here most of our postings-list < 128 docs. So
>> > Lucene41Postings started using vInts…  I did try ForUtil encoding even
>> for
>> > < 128 docs. It definitely went easy on CPU. But failed to measure
>> resulting
>> > file-size increase.
>> >
>> > I realised not just SMP but any other merge must face the same issue and
>> > left it at that..
>>
>> True. Like Robert said, there has been work done on b) already and I
>> think we can move forward on a) too. Thanks for sharing your findings!
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

>
> Would you like to submit a patch that changes SortingMergePolicy to
> use the approach that you are proposing using bitsets instead of
> sorting int[] arrays?


Sure can do that. Can you open a ticket for this, as I don't know what
versions this can go in?

--
Ravi



On Tue, Apr 28, 2015 at 6:03 PM, Adrien Grand <jp...@gmail.com> wrote:

> On Tue, Apr 21, 2015 at 10:00 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > Thanks for the comments…
> >
> > My only
> >> concern about using the FixedBitSet is that it would make sorting each
> >> postings list run in O(maxDoc) but maybe we can make it better by
> >> using SparseFixedBitSet
> >
> >
> > Yes I was also thinking about this. But we are on 4.x and did not take
> the
> > plunge. But as you said, it should be a good idea to test on SFBS
>
> Would you like to submit a patch that changes SortingMergePolicy to
> use the approach that you are proposing using bitsets instead of
> sorting int[] arrays?
>
> > I'm curious if you already performed any kind of benchmarking of this
> >> approach?
> >
> >
> > Yes we did a stress test of sorts aimed at SortingMergePolicy. We made
> most
> > of our data as RAM resident and then CPU hot-spots came up...
> >
> > There were few take-aways from the test. I am listing down few of them..
> > It's kind of lengthy. Please read through...
> >
> > a) Postings-List issue, as discussed above…
> >
> > b) CompressingStoredFieldsReader did not store the last decoded 32KB
> chunk.
> > Our segments are already sorted before participating in a merge. On
> mostly
> > linear merge, we ended up decoding the same chunk again and again. Simply
> > storing the last chunk resulted in good speed-ups for us...
> >
> > c) Once above steps were corrected, the CPU hotspot shifted to
> > BlockDocsEnum. Here most of our postings-list < 128 docs. So
> > Lucene41Postings started using vInts…  I did try ForUtil encoding even
> for
> > < 128 docs. It definitely went easy on CPU. But failed to measure
> resulting
> > file-size increase.
> >
> > I realised not just SMP but any other merge must face the same issue and
> > left it at that..
>
> True. Like Robert said, there has been work done on b) already and I
> think we can move forward on a) too. Thanks for sharing your findings!
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Adrien Grand <jp...@gmail.com>.

On Tue, Apr 21, 2015 at 10:00 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Thanks for the comments…
>
> My only
>> concern about using the FixedBitSet is that it would make sorting each
>> postings list run in O(maxDoc) but maybe we can make it better by
>> using SparseFixedBitSet
>
>
> Yes I was also thinking about this. But we are on 4.x and did not take the
> plunge. But as you said, it should be a good idea to test on SFBS

Would you like to submit a patch that changes SortingMergePolicy to
use the approach that you are proposing using bitsets instead of
sorting int[] arrays?

> I'm curious if you already performed any kind of benchmarking of this
>> approach?
>
>
> Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most
> of our data as RAM resident and then CPU hot-spots came up...
>
> There were few take-aways from the test. I am listing down few of them..
> It's kind of lengthy. Please read through...
>
> a) Postings-List issue, as discussed above…
>
> b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
> Our segments are already sorted before participating in a merge. On mostly
> linear merge, we ended up decoding the same chunk again and again. Simply
> storing the last chunk resulted in good speed-ups for us...
>
> c) Once above steps were corrected, the CPU hotspot shifted to
> BlockDocsEnum. Here most of our postings-list < 128 docs. So
> Lucene41Postings started using vInts…  I did try ForUtil encoding even for
> < 128 docs. It definitely went easy on CPU. But failed to measure resulting
> file-size increase.
>
> I realised not just SMP but any other merge must face the same issue and
> left it at that..

True. Like Robert said, there has been work done on b) already and I
think we can move forward on a) too. Thanks for sharing your findings!

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Thanks for the comments…

My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet


Yes I was also thinking about this. But we are on 4.x and did not take the
plunge. But as you said, it should be a good idea to test on SFBS

I'm curious if you already performed any kind of benchmarking of this
> approach?


Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most
of our data as RAM resident and then CPU hot-spots came up...

There were few take-aways from the test. I am listing down few of them..
It's kind of lengthy. Please read through...

a) Postings-List issue, as discussed above…

b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
Our segments are already sorted before participating in a merge. On mostly
linear merge, we ended up decoding the same chunk again and again. Simply
storing the last chunk resulted in good speed-ups for us...

c) Once above steps were corrected, the CPU hotspot shifted to
BlockDocsEnum. Here most of our postings-list < 128 docs. So
Lucene41Postings started using vInts…  I did try ForUtil encoding even for
< 128 docs. It definitely went easy on CPU. But failed to measure resulting
file-size increase.

I realised not just SMP but any other merge must face the same issue and
left it at that..

--
Ravi

On Mon, Apr 20, 2015 at 12:56 PM, Adrien Grand <jp...@gmail.com> wrote:

> I like these ideas, the int[] we are using today are wasteful. My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet (added in 5.0, given your code snippets I
> assume you are still on 4.x)?
>
> I'm curious if you already performed any kind of benchmarking of this
> approach?
>
>
> On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > We were experimenting with SortingMergePolicy and came across an
> alternate
> > solution to TimSort of postings-list using FBS & GrowableWriter.
> >
> > I have attached relevant code-snippet. It would be nice if someone can
> > clarify whether it is a good idea to implement...
> >
> > public class SortingAtomicReader {
> > …
> > …
> > class SortingDocsEnum {
> >
> > //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
> > //constructor. It is assumed that these 2 variables are init during
> > //merge start & they are then re-used till merge completes...
> >
> >
> > public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
> > final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
> > olddocToFreq) throws IOException {
> >
> > ….
> >
> > …
> >
> > while (true) {
> >
> >   //Instead of Tim-Sorting as in existing code
> >
> >   doc = in.nextDoc();
> >
> >   int newdoc = docMap.oldToNew(doc);
> >
> >   newdoclist.set(newdoc);
> >
> >   if(withFreqs) {
> >
> >     olddocToFreq.set(doc, in.freq());
> >
> >   }
> >
> > }
> >
> >
> > @Override
> >
> > public int nextDoc() throws IOException {
> >
> >   if (++docIt >= upto) {
> >
> >   return NO_MORE_DOCS;
> >
> >   }
> >
> >   currDoc = newdoclist.nextSetBit(++currDoc);
> >
> >   if(currDoc == -1) {
> >
> >     return NO_MORE_DOCS;
> >
> >   }
> >
> >   //clear the set-bit here before returning...
> >
> >   newdoclist.clear(currDoc);
> >
> >   return currDoc;
> >
> > }
> >
> >
> > @Override
> >
> > public int freq() throws IOException {
> >
> >   if(withFreqs && docIt < upto) {
> >
> >   return (int)olddocToFreq.getMutable()
> >
> >                  .get(docMap.newToOld(currDoc));
> >
> >   }
> >
> >   return 1;
> >
> > }
> >
> > }
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: SortingAtomicReader alternate to Tim-Sort...

Posted by Adrien Grand <jp...@gmail.com>.

I like these ideas, the int[] we are using today are wasteful. My only
concern about using the FixedBitSet is that it would make sorting each
postings list run in O(maxDoc) but maybe we can make it better by
using SparseFixedBitSet (added in 5.0, given your code snippets I
assume you are still on 4.x)?

I'm curious if you already performed any kind of benchmarking of this approach?


On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> We were experimenting with SortingMergePolicy and came across an alternate
> solution to TimSort of postings-list using FBS & GrowableWriter.
>
> I have attached relevant code-snippet. It would be nice if someone can
> clarify whether it is a good idea to implement...
>
> public class SortingAtomicReader {
> …
> …
> class SortingDocsEnum {
>
> //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
> //constructor. It is assumed that these 2 variables are init during
> //merge start & they are then re-used till merge completes...
>
>
> public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
> final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
> olddocToFreq) throws IOException {
>
> ….
>
> …
>
> while (true) {
>
>   //Instead of Tim-Sorting as in existing code
>
>   doc = in.nextDoc();
>
>   int newdoc = docMap.oldToNew(doc);
>
>   newdoclist.set(newdoc);
>
>   if(withFreqs) {
>
>     olddocToFreq.set(doc, in.freq());
>
>   }
>
> }
>
>
> @Override
>
> public int nextDoc() throws IOException {
>
>   if (++docIt >= upto) {
>
>   return NO_MORE_DOCS;
>
>   }
>
>   currDoc = newdoclist.nextSetBit(++currDoc);
>
>   if(currDoc == -1) {
>
>     return NO_MORE_DOCS;
>
>   }
>
>   //clear the set-bit here before returning...
>
>   newdoclist.clear(currDoc);
>
>   return currDoc;
>
> }
>
>
> @Override
>
> public int freq() throws IOException {
>
>   if(withFreqs && docIt < upto) {
>
>   return (int)olddocToFreq.getMutable()
>
>                  .get(docMap.newToOld(currDoc));
>
>   }
>
>   return 1;
>
> }
>
> }



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org