You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ravikumar Govindarajan <ra...@gmail.com> on 2014/02/06 11:16:35 UTC

Actual min and max-value of NumericField during codec flush

I use a Codec to flush data. All methods delegate to actual Lucene42Codec,
except for intercepting one single-field. This field is indexed as an
IntField [Numeric-Trie...], with precisionStep=4.

The purpose of the Codec is as follows

1. Note the first BytesRef for this field
2. During finish() call [TermsConsumer.java], note the last BytesRef for
this field
3. Converts both the first/last BytesRef to respective integers
4. Store these 2 ints in segment-info diagnostics

The problem with this approach is that, first/last BytesRef is totally
different from the actual "int" values I try to index. I guess, this is
because Numeric-Trie explodes all the integers into it's own format of
BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics

Is there a way I can record actual min/max int-values correctly in my codec
and still support NumericRange search?

--
Ravi

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Thanks Mike for your time and help


On Monday, February 17, 2014, Michael McCandless <lu...@mikemccandless.com>
wrote:

> On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com <javascript:;>> wrote:
> >>
> >> Well, this will change your scores?  MultiReader will sum up all term
> >> statistics across all SegmentReaders "up front", and then scoring per
> >> segment will use those top-level weights.
> >
> >
> > Our app needs to do only matching and sorting. In-fact, it would be fully
> > OK to by-pass scoring. But I feel scoring must be blazing fast, that
> there
> > should be no gains of avoiding it. Can you please confirm if this is the
> > case
>
> You should avoid it if in fact you don't use it.  What are you sorting
> by?  If you sort by field, and don't ask for scores, then scores won't
> be computed.
>
> > Which addIndexes method are you using?  The one taking Directory[]
> >> does file-level copies, assigning sequential segment names (but this
> >> is not guaranteed), and the one taking IndexReader[] merges all the
> >> incoming indices into a single segment.
> >
> >
> > I am planning to use the IndexReader[] to merge out-of-order segments,
> > which makes it go easier on timestamp based merges
>
> OK.
>
> >> You may need to just impl a custom MergePolicy that sorts all segments
> in
> >> the index by timestamp and picks the merge order accordingly...
> >
> >
> > Yes, this is what I think I will do, with a SortingMP wrapper. I hope
> > merges will work fine, after accumulating considerable data over a period
> > of time.
>
> OK good luck and have fun :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<javascript:;>
> For additional commands, e-mail: java-user-help@lucene.apache.org<javascript:;>
>
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
>>
>> Well, this will change your scores?  MultiReader will sum up all term
>> statistics across all SegmentReaders "up front", and then scoring per
>> segment will use those top-level weights.
>
>
> Our app needs to do only matching and sorting. In-fact, it would be fully
> OK to by-pass scoring. But I feel scoring must be blazing fast, that there
> should be no gains of avoiding it. Can you please confirm if this is the
> case

You should avoid it if in fact you don't use it.  What are you sorting
by?  If you sort by field, and don't ask for scores, then scores won't
be computed.

> Which addIndexes method are you using?  The one taking Directory[]
>> does file-level copies, assigning sequential segment names (but this
>> is not guaranteed), and the one taking IndexReader[] merges all the
>> incoming indices into a single segment.
>
>
> I am planning to use the IndexReader[] to merge out-of-order segments,
> which makes it go easier on timestamp based merges

OK.

>> You may need to just impl a custom MergePolicy that sorts all segments in
>> the index by timestamp and picks the merge order accordingly...
>
>
> Yes, this is what I think I will do, with a SortingMP wrapper. I hope
> merges will work fine, after accumulating considerable data over a period
> of time.

OK good luck and have fun :)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

>
> Well, this will change your scores?  MultiReader will sum up all term
> statistics across all SegmentReaders "up front", and then scoring per
> segment will use those top-level weights.


Our app needs to do only matching and sorting. In-fact, it would be fully
OK to by-pass scoring. But I feel scoring must be blazing fast, that there
should be no gains of avoiding it. Can you please confirm if this is the
case

Which addIndexes method are you using?  The one taking Directory[]
> does file-level copies, assigning sequential segment names (but this
> is not guaranteed), and the one taking IndexReader[] merges all the
> incoming indices into a single segment.


I am planning to use the IndexReader[] to merge out-of-order segments,
which makes it go easier on timestamp based merges

  You may need to just impl a custom MergePolicy that sorts all segments in
> the index by timestamp and picks the merge order accordingly...


Yes, this is what I think I will do, with a SortingMP wrapper. I hope
merges will work fine, after accumulating considerable data over a period
of time.

--
Ravi

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Feb 14, 2014 at 12:14 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:

> Early-Query termination quits by throwing an Exception right?. Is it ok to
> individually search using SegmentReader and then break-off, instead of
> using a MultiReader, especially when the order is known before search
> begins?

Well, this will change your scores?  MultiReader will sum up all term
statistics across all SegmentReaders "up front", and then scoring per
segment will use those top-level weights.

> The reason why I insisted on a time-stamp based merging is because there is
> a possiblility of an out-of-order segment added via addIndex(...) call.
> That segment can be of any older time-stamp [month ago, year-ago etc...],
> albeit extremely rare. Should I worry about it during merges, or just
> handle overlaps during search

Which addIndexes method are you using?  The one taking Directory[]
does file-level copies, assigning sequential segment names (but this
is not guaranteed), and the one taking IndexReader[] merges all the
incoming indices into a single segment.

You may need to just impl a custom MergePolicy that sorts all segments
in the index by timestamp and picks the merge order accordingly...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Yeah, now I understood a little bit.

Since LogMP always merges adjacent segments, that should pretty much serve
my use-case, when used with a SortingMP

Early-Query termination quits by throwing an Exception right?. Is it ok to
individually search using SegmentReader and then break-off, instead of
using a MultiReader, especially when the order is known before search
begins?

The reason why I insisted on a time-stamp based merging is because there is
a possiblility of an out-of-order segment added via addIndex(...) call.
That segment can be of any older time-stamp [month ago, year-ago etc...],
albeit extremely rare. Should I worry about it during merges, or just
handle overlaps during search

--
Ravi



On Thu, Feb 13, 2014 at 1:21 PM, Shai Erera <se...@gmail.com> wrote:

> Hi
>
> LogMP *always* picks adjacent segments together. Therefore, if you have
> segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
> LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
> segments and in a raw (i.e. it doesn't skip segments).
>
> I guess what both Mike and I don't understand is why you insist on merging
> based on the timestamp of each segment. I.e. if the order, timestamp-wise,
> of the segments isn't as I described above, then merging them like so won't
> hurt - i.e. they will still be unsorted. No harm is done.
>
> Maybe MergePolicy isn't what you need here. If you can record somewhere the
> min/max timestamp of each segment, you can use a MultiReader to wrap the
> sorted list of IndexReaders (actually SegmentReaders). Then your "reader",
> always traverses segments from new to old.
>
> If this approach won't address your issue, then you can merge based on
> timestamps - there's nothing wrong about it. What Mike suggested is that
> you benchmark your application with this merge policy, for a long period of
> time (few hours/days, depending on your indexing rate), because what might
> happen is that your merges are always unbalanced and your indexing
> performance will degrade because of unbalanced amount of IO that happens
> during the merge.
>
> Shai
>
>
> On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > @Mike,
> >
> > I had suggested the same approach in one of my previous mails, where-by
> > each segment records min/max timestamps in seg-info diagnostics and use
> it
> > for merging adjacent segments.
> >
> > "Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "
> >
> > But you have expressed reservations
> >
> > "This seems somewhat dangerous...
> >
> > Not taking into account the "true" segment size can lead to very very
> > poor merge decisions ... you should turn on IndexWriter's infoStream
> > and do a long running test to convince yourself the merging is being
> > sane."
> >
> > Will merging be disastrous, if I choose a TimeMergePolicy? I will also
> test
> > and verify, but it's always great to hear finer points from experts.
> >
> > @Shai,
> >
> > LogByteSizeMP categorizes "adjacency" by "size", whereas it would be
> better
> > if "timestamp" is used in my case
> >
> > Sure, I need to wrap this in an SMP to make sure that the newly-created
> > segment is also in sorted-order
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <se...@gmail.com> wrote:
> >
> > > Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
> > adjacent
> > > segments and SortingMP ensures the merged segment is also sorted.
> > >
> > > Shai
> > >
> > >
> > > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> > > ravikumar.govindarajan@gmail.com> wrote:
> > >
> > > > Yes exactly as you have described.
> > > >
> > > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological
> order
> > > and
> > > > goes for a merge
> > > >
> > > > While SortingMergePolicy will correctly solve the merge-part, it does
> > not
> > > > however play any role in picking segments to merge right?
> > > >
> > > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4
> > to
> > > > merge disturbing the global-order. Ideally only "adjacent" segments
> > > should
> > > > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> > > >
> > > > Can there be a better selection of segments to merge in this case, so
> > as
> > > to
> > > > maintain a semblance of global-ordering?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > > > lucene@mikemccandless.com> wrote:
> > > >
> > > > > OK, I see (early termination).
> > > > >
> > > > > That's a challenge, because you really want the docs sorted
> backwards
> > > > > from how they were added right?  And, e.g., merged and then
> searched
> > > > > in "reverse segment order"?
> > > > >
> > > > > I think you should be able to do this w/ SortingMergePolicy?  And
> > then
> > > > > use a custom collector that stops after you've gone back enough in
> > > > > time for a given search.
> > > > >
> > > > > Mike McCandless
> > > > >
> > > > > http://blog.mikemccandless.com
> > > > >
> > > > >
> > > > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > > > <ra...@gmail.com> wrote:
> > > > > > Mike,
> > > > > >
> > > > > > All our queries need to be sorted by timestamp field, in
> descending
> > > > order
> > > > > > of time. [latest-first]
> > > > > >
> > > > > > Each segment is sorted in itself. But TieredMergePolicy picks
> > > arbitrary
> > > > > > segments and merges them [even with SortingMergePolicy etc...]. I
> > am
> > > > > trying
> > > > > > to avoid this and see if an approximate global ordering of
> segments
> > > [by
> > > > > > time-stamp field] can be maintained via merge.
> > > > > >
> > > > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > > > [best-case]
> > > > > > and return, without examining older and bigger segments.
> > > > > >
> > > > > > I do not know the terminology, may be "Early Query Termination
> > Across
> > > > > > Segments" etc...?
> > > > > >
> > > > > > --
> > > > > > Ravi
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > > > lucene@mikemccandless.com> wrote:
> > > > > >
> > > > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> > > total
> > > > > >> order.
> > > > > >>
> > > > > >> Only TieredMergePolicy merges out-of-order segments.
> > > > > >>
> > > > > >> I don't understand why you need to encouraging merging of the
> more
> > > > > >> recent (by your "time" field) segments...
> > > > > >>
> > > > > >> Mike McCandless
> > > > > >>
> > > > > >> http://blog.mikemccandless.com
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > > > >> <ra...@gmail.com> wrote:
> > > > > >> > Mike,
> > > > > >> >
> > > > > >> > Each of my flushed segment is fully ordered by time. But
> > > > > >> TieredMergePolicy
> > > > > >> > or LogByteSizeMergePolicy is going to pick arbitrary
> > time-segments
> > > > and
> > > > > >> > disturb this arrangement and I wanted some kind of control on
> > > this.
> > > > > >> >
> > > > > >> > But like you pointed-out, going by only be time-adjacent
> merges
> > > can
> > > > be
> > > > > >> > disastrous.
> > > > > >> >
> > > > > >> > Is there a way to mix both time and size to arrive at a
> somewhat
> > > > > >> > [less-than-accurate] global order of segment merges.
> > > > > >> >
> > > > > >> > Like attempt a time-adjacent merge, provided size of segments
> is
> > > not
> > > > > >> > extremely skewed etc...
> > > > > >> >
> > > > > >> > --
> > > > > >> > Ravi
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > > > >> > lucene@mikemccandless.com> wrote:
> > > > > >> >
> > > > > >> >> You want to focus merging on the segments containing newer
> > > > documents?
> > > > > >> >> Why?  This seems somewhat dangerous...
> > > > > >> >>
> > > > > >> >> Not taking into account the "true" segment size can lead to
> > very
> > > > very
> > > > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > > > infoStream
> > > > > >> >> and do a long running test to convince yourself the merging
> is
> > > > being
> > > > > >> >> sane.
> > > > > >> >>
> > > > > >> >> Mike
> > > > > >> >>
> > > > > >> >> Mike McCandless
> > > > > >> >>
> > > > > >> >> http://blog.mikemccandless.com
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > > > >> >> <ra...@gmail.com> wrote:
> > > > > >> >> > Thanks Mike,
> > > > > >> >> >
> > > > > >> >> > Will try your suggestion. I will try to describe the actual
> > > > > use-case
> > > > > >> >> itself
> > > > > >> >> >
> > > > > >> >> > There is a requirement for merging time-adjacent segments
> > > > > >> [append-only,
> > > > > >> >> > rolling time-series data]
> > > > > >> >> >
> > > > > >> >> > All Documents have a timestamp affixed and during flush I
> > need
> > > to
> > > > > note
> > > > > >> >> down
> > > > > >> >> > the least timestamp for all documents, through Codec.
> > > > > >> >> >
> > > > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > > > define
> > > > > the
> > > > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME
> [segment-diag].
> > > > > >> >> >
> > > > > >> >> > LogMergePolicy will auto-arrange levels of segments
> according
> > > > time
> > > > > and
> > > > > >> >> > proceed with merges. Latest segments will be lesser in size
> > and
> > > > > >> preferred
> > > > > >> >> > during merges than older and bigger segments
> > > > > >> >> >
> > > > > >> >> > Do you think such an approach will be fine or there are
> > better
> > > > > ways to
> > > > > >> >> > solve this?
> > > > > >> >> >
> > > > > >> >> > --
> > > > > >> >> > Ravi
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > > > >> >> > lucene@mikemccandless.com> wrote:
> > > > > >> >> >
> > > > > >> >> >> Somewhere in those numeric trie terms are the exact
> integers
> > > > from
> > > > > >> your
> > > > > >> >> >> documents, encoded.
> > > > > >> >> >>
> > > > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get
> > the
> > > > int
> > > > > >> >> >> value back from the BytesRef term.
> > > > > >> >> >>
> > > > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> > > using
> > > > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.
>  I
> > > > > believe
> > > > > >> >> >> all the terms you want come first, so once you hit a term
> > > where
> > > > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and
> > you
> > > > can
> > > > > >> stop
> > > > > >> >> >> checking.
> > > > > >> >> >>
> > > > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has
> improved,
> > so
> > > > > that
> > > > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > > > yourself.
> > > > > >> >> >>
> > > > > >> >> >> Mike McCandless
> > > > > >> >> >>
> > > > > >> >> >> http://blog.mikemccandless.com
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > > > >> >> >> <ra...@gmail.com> wrote:
> > > > > >> >> >> > I use a Codec to flush data. All methods delegate to
> > actual
> > > > > >> >> >> Lucene42Codec,
> > > > > >> >> >> > except for intercepting one single-field. This field is
> > > > indexed
> > > > > as
> > > > > >> an
> > > > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > > > >> >> >> >
> > > > > >> >> >> > The purpose of the Codec is as follows
> > > > > >> >> >> >
> > > > > >> >> >> > 1. Note the first BytesRef for this field
> > > > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the
> > last
> > > > > >> BytesRef
> > > > > >> >> for
> > > > > >> >> >> > this field
> > > > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > > > integers
> > > > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > > > >> >> >> >
> > > > > >> >> >> > The problem with this approach is that, first/last
> > BytesRef
> > > is
> > > > > >> totally
> > > > > >> >> >> > different from the actual "int" values I try to index. I
> > > > guess,
> > > > > >> this
> > > > > >> >> is
> > > > > >> >> >> > because Numeric-Trie explodes all the integers into it's
> > own
> > > > > >> format of
> > > > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > > > >> >> segment-diagnostics
> > > > > >> >> >> >
> > > > > >> >> >> > Is there a way I can record actual min/max int-values
> > > > correctly
> > > > > in
> > > > > >> my
> > > > > >> >> >> codec
> > > > > >> >> >> > and still support NumericRange search?
> > > > > >> >> >> >
> > > > > >> >> >> > --
> > > > > >> >> >> > Ravi
> > > > > >> >> >>
> > > > > >> >> >>
> > > > >
> ---------------------------------------------------------------------
> > > > > >> >> >> To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > >> >> >> For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >>
> > > > > >> >>
> > > > ---------------------------------------------------------------------
> > > > > >> >> To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > >> >> For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > ---------------------------------------------------------------------
> > > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Shai Erera <se...@gmail.com>.

Hi

LogMP *always* picks adjacent segments together. Therefore, if you have
segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
segments and in a raw (i.e. it doesn't skip segments).

I guess what both Mike and I don't understand is why you insist on merging
based on the timestamp of each segment. I.e. if the order, timestamp-wise,
of the segments isn't as I described above, then merging them like so won't
hurt - i.e. they will still be unsorted. No harm is done.

Maybe MergePolicy isn't what you need here. If you can record somewhere the
min/max timestamp of each segment, you can use a MultiReader to wrap the
sorted list of IndexReaders (actually SegmentReaders). Then your "reader",
always traverses segments from new to old.

If this approach won't address your issue, then you can merge based on
timestamps - there's nothing wrong about it. What Mike suggested is that
you benchmark your application with this merge policy, for a long period of
time (few hours/days, depending on your indexing rate), because what might
happen is that your merges are always unbalanced and your indexing
performance will degrade because of unbalanced amount of IO that happens
during the merge.

Shai


On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> @Mike,
>
> I had suggested the same approach in one of my previous mails, where-by
> each segment records min/max timestamps in seg-info diagnostics and use it
> for merging adjacent segments.
>
> "Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "
>
> But you have expressed reservations
>
> "This seems somewhat dangerous...
>
> Not taking into account the "true" segment size can lead to very very
> poor merge decisions ... you should turn on IndexWriter's infoStream
> and do a long running test to convince yourself the merging is being
> sane."
>
> Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
> and verify, but it's always great to hear finer points from experts.
>
> @Shai,
>
> LogByteSizeMP categorizes "adjacency" by "size", whereas it would be better
> if "timestamp" is used in my case
>
> Sure, I need to wrap this in an SMP to make sure that the newly-created
> segment is also in sorted-order
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <se...@gmail.com> wrote:
>
> > Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
> adjacent
> > segments and SortingMP ensures the merged segment is also sorted.
> >
> > Shai
> >
> >
> > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> > ravikumar.govindarajan@gmail.com> wrote:
> >
> > > Yes exactly as you have described.
> > >
> > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order
> > and
> > > goes for a merge
> > >
> > > While SortingMergePolicy will correctly solve the merge-part, it does
> not
> > > however play any role in picking segments to merge right?
> > >
> > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4
> to
> > > merge disturbing the global-order. Ideally only "adjacent" segments
> > should
> > > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> > >
> > > Can there be a better selection of segments to merge in this case, so
> as
> > to
> > > maintain a semblance of global-ordering?
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > > > OK, I see (early termination).
> > > >
> > > > That's a challenge, because you really want the docs sorted backwards
> > > > from how they were added right?  And, e.g., merged and then searched
> > > > in "reverse segment order"?
> > > >
> > > > I think you should be able to do this w/ SortingMergePolicy?  And
> then
> > > > use a custom collector that stops after you've gone back enough in
> > > > time for a given search.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > > <ra...@gmail.com> wrote:
> > > > > Mike,
> > > > >
> > > > > All our queries need to be sorted by timestamp field, in descending
> > > order
> > > > > of time. [latest-first]
> > > > >
> > > > > Each segment is sorted in itself. But TieredMergePolicy picks
> > arbitrary
> > > > > segments and merges them [even with SortingMergePolicy etc...]. I
> am
> > > > trying
> > > > > to avoid this and see if an approximate global ordering of segments
> > [by
> > > > > time-stamp field] can be maintained via merge.
> > > > >
> > > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > > [best-case]
> > > > > and return, without examining older and bigger segments.
> > > > >
> > > > > I do not know the terminology, may be "Early Query Termination
> Across
> > > > > Segments" etc...?
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > > lucene@mikemccandless.com> wrote:
> > > > >
> > > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> > total
> > > > >> order.
> > > > >>
> > > > >> Only TieredMergePolicy merges out-of-order segments.
> > > > >>
> > > > >> I don't understand why you need to encouraging merging of the more
> > > > >> recent (by your "time" field) segments...
> > > > >>
> > > > >> Mike McCandless
> > > > >>
> > > > >> http://blog.mikemccandless.com
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > > >> <ra...@gmail.com> wrote:
> > > > >> > Mike,
> > > > >> >
> > > > >> > Each of my flushed segment is fully ordered by time. But
> > > > >> TieredMergePolicy
> > > > >> > or LogByteSizeMergePolicy is going to pick arbitrary
> time-segments
> > > and
> > > > >> > disturb this arrangement and I wanted some kind of control on
> > this.
> > > > >> >
> > > > >> > But like you pointed-out, going by only be time-adjacent merges
> > can
> > > be
> > > > >> > disastrous.
> > > > >> >
> > > > >> > Is there a way to mix both time and size to arrive at a somewhat
> > > > >> > [less-than-accurate] global order of segment merges.
> > > > >> >
> > > > >> > Like attempt a time-adjacent merge, provided size of segments is
> > not
> > > > >> > extremely skewed etc...
> > > > >> >
> > > > >> > --
> > > > >> > Ravi
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > > >> > lucene@mikemccandless.com> wrote:
> > > > >> >
> > > > >> >> You want to focus merging on the segments containing newer
> > > documents?
> > > > >> >> Why?  This seems somewhat dangerous...
> > > > >> >>
> > > > >> >> Not taking into account the "true" segment size can lead to
> very
> > > very
> > > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > > infoStream
> > > > >> >> and do a long running test to convince yourself the merging is
> > > being
> > > > >> >> sane.
> > > > >> >>
> > > > >> >> Mike
> > > > >> >>
> > > > >> >> Mike McCandless
> > > > >> >>
> > > > >> >> http://blog.mikemccandless.com
> > > > >> >>
> > > > >> >>
> > > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > > >> >> <ra...@gmail.com> wrote:
> > > > >> >> > Thanks Mike,
> > > > >> >> >
> > > > >> >> > Will try your suggestion. I will try to describe the actual
> > > > use-case
> > > > >> >> itself
> > > > >> >> >
> > > > >> >> > There is a requirement for merging time-adjacent segments
> > > > >> [append-only,
> > > > >> >> > rolling time-series data]
> > > > >> >> >
> > > > >> >> > All Documents have a timestamp affixed and during flush I
> need
> > to
> > > > note
> > > > >> >> down
> > > > >> >> > the least timestamp for all documents, through Codec.
> > > > >> >> >
> > > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > > define
> > > > the
> > > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > > > >> >> >
> > > > >> >> > LogMergePolicy will auto-arrange levels of segments according
> > > time
> > > > and
> > > > >> >> > proceed with merges. Latest segments will be lesser in size
> and
> > > > >> preferred
> > > > >> >> > during merges than older and bigger segments
> > > > >> >> >
> > > > >> >> > Do you think such an approach will be fine or there are
> better
> > > > ways to
> > > > >> >> > solve this?
> > > > >> >> >
> > > > >> >> > --
> > > > >> >> > Ravi
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > > >> >> > lucene@mikemccandless.com> wrote:
> > > > >> >> >
> > > > >> >> >> Somewhere in those numeric trie terms are the exact integers
> > > from
> > > > >> your
> > > > >> >> >> documents, encoded.
> > > > >> >> >>
> > > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get
> the
> > > int
> > > > >> >> >> value back from the BytesRef term.
> > > > >> >> >>
> > > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> > using
> > > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > > > believe
> > > > >> >> >> all the terms you want come first, so once you hit a term
> > where
> > > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and
> you
> > > can
> > > > >> stop
> > > > >> >> >> checking.
> > > > >> >> >>
> > > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved,
> so
> > > > that
> > > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > > yourself.
> > > > >> >> >>
> > > > >> >> >> Mike McCandless
> > > > >> >> >>
> > > > >> >> >> http://blog.mikemccandless.com
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > > >> >> >> <ra...@gmail.com> wrote:
> > > > >> >> >> > I use a Codec to flush data. All methods delegate to
> actual
> > > > >> >> >> Lucene42Codec,
> > > > >> >> >> > except for intercepting one single-field. This field is
> > > indexed
> > > > as
> > > > >> an
> > > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > > >> >> >> >
> > > > >> >> >> > The purpose of the Codec is as follows
> > > > >> >> >> >
> > > > >> >> >> > 1. Note the first BytesRef for this field
> > > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the
> last
> > > > >> BytesRef
> > > > >> >> for
> > > > >> >> >> > this field
> > > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > > integers
> > > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > > >> >> >> >
> > > > >> >> >> > The problem with this approach is that, first/last
> BytesRef
> > is
> > > > >> totally
> > > > >> >> >> > different from the actual "int" values I try to index. I
> > > guess,
> > > > >> this
> > > > >> >> is
> > > > >> >> >> > because Numeric-Trie explodes all the integers into it's
> own
> > > > >> format of
> > > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > > >> >> segment-diagnostics
> > > > >> >> >> >
> > > > >> >> >> > Is there a way I can record actual min/max int-values
> > > correctly
> > > > in
> > > > >> my
> > > > >> >> >> codec
> > > > >> >> >> > and still support NumericRange search?
> > > > >> >> >> >
> > > > >> >> >> > --
> > > > >> >> >> > Ravi
> > > > >> >> >>
> > > > >> >> >>
> > > > ---------------------------------------------------------------------
> > > > >> >> >> To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > >> >> >> For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >>
> > > > >> >>
> > > ---------------------------------------------------------------------
> > > > >> >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > >> >> For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

@Mike,

I had suggested the same approach in one of my previous mails, where-by
each segment records min/max timestamps in seg-info diagnostics and use it
for merging adjacent segments.

"Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "

But you have expressed reservations

"This seems somewhat dangerous...

Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself the merging is being
sane."

Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
and verify, but it's always great to hear finer points from experts.

@Shai,

LogByteSizeMP categorizes "adjacency" by "size", whereas it would be better
if "timestamp" is used in my case

Sure, I need to wrap this in an SMP to make sure that the newly-created
segment is also in sorted-order

--
Ravi



On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <se...@gmail.com> wrote:

> Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
> segments and SortingMP ensures the merged segment is also sorted.
>
> Shai
>
>
> On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > Yes exactly as you have described.
> >
> > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order
> and
> > goes for a merge
> >
> > While SortingMergePolicy will correctly solve the merge-part, it does not
> > however play any role in picking segments to merge right?
> >
> > SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> > merge disturbing the global-order. Ideally only "adjacent" segments
> should
> > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> >
> > Can there be a better selection of segments to merge in this case, so as
> to
> > maintain a semblance of global-ordering?
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> > > OK, I see (early termination).
> > >
> > > That's a challenge, because you really want the docs sorted backwards
> > > from how they were added right?  And, e.g., merged and then searched
> > > in "reverse segment order"?
> > >
> > > I think you should be able to do this w/ SortingMergePolicy?  And then
> > > use a custom collector that stops after you've gone back enough in
> > > time for a given search.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > <ra...@gmail.com> wrote:
> > > > Mike,
> > > >
> > > > All our queries need to be sorted by timestamp field, in descending
> > order
> > > > of time. [latest-first]
> > > >
> > > > Each segment is sorted in itself. But TieredMergePolicy picks
> arbitrary
> > > > segments and merges them [even with SortingMergePolicy etc...]. I am
> > > trying
> > > > to avoid this and see if an approximate global ordering of segments
> [by
> > > > time-stamp field] can be maintained via merge.
> > > >
> > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > [best-case]
> > > > and return, without examining older and bigger segments.
> > > >
> > > > I do not know the terminology, may be "Early Query Termination Across
> > > > Segments" etc...?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > lucene@mikemccandless.com> wrote:
> > > >
> > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> total
> > > >> order.
> > > >>
> > > >> Only TieredMergePolicy merges out-of-order segments.
> > > >>
> > > >> I don't understand why you need to encouraging merging of the more
> > > >> recent (by your "time" field) segments...
> > > >>
> > > >> Mike McCandless
> > > >>
> > > >> http://blog.mikemccandless.com
> > > >>
> > > >>
> > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > >> <ra...@gmail.com> wrote:
> > > >> > Mike,
> > > >> >
> > > >> > Each of my flushed segment is fully ordered by time. But
> > > >> TieredMergePolicy
> > > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments
> > and
> > > >> > disturb this arrangement and I wanted some kind of control on
> this.
> > > >> >
> > > >> > But like you pointed-out, going by only be time-adjacent merges
> can
> > be
> > > >> > disastrous.
> > > >> >
> > > >> > Is there a way to mix both time and size to arrive at a somewhat
> > > >> > [less-than-accurate] global order of segment merges.
> > > >> >
> > > >> > Like attempt a time-adjacent merge, provided size of segments is
> not
> > > >> > extremely skewed etc...
> > > >> >
> > > >> > --
> > > >> > Ravi
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > >> > lucene@mikemccandless.com> wrote:
> > > >> >
> > > >> >> You want to focus merging on the segments containing newer
> > documents?
> > > >> >> Why?  This seems somewhat dangerous...
> > > >> >>
> > > >> >> Not taking into account the "true" segment size can lead to very
> > very
> > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > infoStream
> > > >> >> and do a long running test to convince yourself the merging is
> > being
> > > >> >> sane.
> > > >> >>
> > > >> >> Mike
> > > >> >>
> > > >> >> Mike McCandless
> > > >> >>
> > > >> >> http://blog.mikemccandless.com
> > > >> >>
> > > >> >>
> > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > >> >> <ra...@gmail.com> wrote:
> > > >> >> > Thanks Mike,
> > > >> >> >
> > > >> >> > Will try your suggestion. I will try to describe the actual
> > > use-case
> > > >> >> itself
> > > >> >> >
> > > >> >> > There is a requirement for merging time-adjacent segments
> > > >> [append-only,
> > > >> >> > rolling time-series data]
> > > >> >> >
> > > >> >> > All Documents have a timestamp affixed and during flush I need
> to
> > > note
> > > >> >> down
> > > >> >> > the least timestamp for all documents, through Codec.
> > > >> >> >
> > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > define
> > > the
> > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > > >> >> >
> > > >> >> > LogMergePolicy will auto-arrange levels of segments according
> > time
> > > and
> > > >> >> > proceed with merges. Latest segments will be lesser in size and
> > > >> preferred
> > > >> >> > during merges than older and bigger segments
> > > >> >> >
> > > >> >> > Do you think such an approach will be fine or there are better
> > > ways to
> > > >> >> > solve this?
> > > >> >> >
> > > >> >> > --
> > > >> >> > Ravi
> > > >> >> >
> > > >> >> >
> > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > >> >> > lucene@mikemccandless.com> wrote:
> > > >> >> >
> > > >> >> >> Somewhere in those numeric trie terms are the exact integers
> > from
> > > >> your
> > > >> >> >> documents, encoded.
> > > >> >> >>
> > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the
> > int
> > > >> >> >> value back from the BytesRef term.
> > > >> >> >>
> > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> using
> > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > > believe
> > > >> >> >> all the terms you want come first, so once you hit a term
> where
> > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you
> > can
> > > >> stop
> > > >> >> >> checking.
> > > >> >> >>
> > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> > > that
> > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > yourself.
> > > >> >> >>
> > > >> >> >> Mike McCandless
> > > >> >> >>
> > > >> >> >> http://blog.mikemccandless.com
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > >> >> >> <ra...@gmail.com> wrote:
> > > >> >> >> > I use a Codec to flush data. All methods delegate to actual
> > > >> >> >> Lucene42Codec,
> > > >> >> >> > except for intercepting one single-field. This field is
> > indexed
> > > as
> > > >> an
> > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > >> >> >> >
> > > >> >> >> > The purpose of the Codec is as follows
> > > >> >> >> >
> > > >> >> >> > 1. Note the first BytesRef for this field
> > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> > > >> BytesRef
> > > >> >> for
> > > >> >> >> > this field
> > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > integers
> > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > >> >> >> >
> > > >> >> >> > The problem with this approach is that, first/last BytesRef
> is
> > > >> totally
> > > >> >> >> > different from the actual "int" values I try to index. I
> > guess,
> > > >> this
> > > >> >> is
> > > >> >> >> > because Numeric-Trie explodes all the integers into it's own
> > > >> format of
> > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > >> >> segment-diagnostics
> > > >> >> >> >
> > > >> >> >> > Is there a way I can record actual min/max int-values
> > correctly
> > > in
> > > >> my
> > > >> >> >> codec
> > > >> >> >> > and still support NumericRange search?
> > > >> >> >> >
> > > >> >> >> > --
> > > >> >> >> > Ravi
> > > >> >> >>
> > > >> >> >>
> > > ---------------------------------------------------------------------
> > > >> >> >> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > >> >> >> For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > >> >> >>
> > > >> >> >>
> > > >> >>
> > > >> >>
> > ---------------------------------------------------------------------
> > > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Shai Erera <se...@gmail.com>.

Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
segments and SortingMP ensures the merged segment is also sorted.

Shai


On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> Yes exactly as you have described.
>
> Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
> goes for a merge
>
> While SortingMergePolicy will correctly solve the merge-part, it does not
> however play any role in picking segments to merge right?
>
> SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> merge disturbing the global-order. Ideally only "adjacent" segments should
> be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
>
> Can there be a better selection of segments to merge in this case, so as to
> maintain a semblance of global-ordering?
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > OK, I see (early termination).
> >
> > That's a challenge, because you really want the docs sorted backwards
> > from how they were added right?  And, e.g., merged and then searched
> > in "reverse segment order"?
> >
> > I think you should be able to do this w/ SortingMergePolicy?  And then
> > use a custom collector that stops after you've gone back enough in
> > time for a given search.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > <ra...@gmail.com> wrote:
> > > Mike,
> > >
> > > All our queries need to be sorted by timestamp field, in descending
> order
> > > of time. [latest-first]
> > >
> > > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> > > segments and merges them [even with SortingMergePolicy etc...]. I am
> > trying
> > > to avoid this and see if an approximate global ordering of segments [by
> > > time-stamp field] can be maintained via merge.
> > >
> > > Ex: TopN results will only examine recent 2-3 smaller segments
> > [best-case]
> > > and return, without examining older and bigger segments.
> > >
> > > I do not know the terminology, may be "Early Query Termination Across
> > > Segments" etc...?
> > >
> > > --
> > > Ravi
> > >
> > >
> > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> > >> order.
> > >>
> > >> Only TieredMergePolicy merges out-of-order segments.
> > >>
> > >> I don't understand why you need to encouraging merging of the more
> > >> recent (by your "time" field) segments...
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > >> <ra...@gmail.com> wrote:
> > >> > Mike,
> > >> >
> > >> > Each of my flushed segment is fully ordered by time. But
> > >> TieredMergePolicy
> > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments
> and
> > >> > disturb this arrangement and I wanted some kind of control on this.
> > >> >
> > >> > But like you pointed-out, going by only be time-adjacent merges can
> be
> > >> > disastrous.
> > >> >
> > >> > Is there a way to mix both time and size to arrive at a somewhat
> > >> > [less-than-accurate] global order of segment merges.
> > >> >
> > >> > Like attempt a time-adjacent merge, provided size of segments is not
> > >> > extremely skewed etc...
> > >> >
> > >> > --
> > >> > Ravi
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > >> > lucene@mikemccandless.com> wrote:
> > >> >
> > >> >> You want to focus merging on the segments containing newer
> documents?
> > >> >> Why?  This seems somewhat dangerous...
> > >> >>
> > >> >> Not taking into account the "true" segment size can lead to very
> very
> > >> >> poor merge decisions ... you should turn on IndexWriter's
> infoStream
> > >> >> and do a long running test to convince yourself the merging is
> being
> > >> >> sane.
> > >> >>
> > >> >> Mike
> > >> >>
> > >> >> Mike McCandless
> > >> >>
> > >> >> http://blog.mikemccandless.com
> > >> >>
> > >> >>
> > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > >> >> <ra...@gmail.com> wrote:
> > >> >> > Thanks Mike,
> > >> >> >
> > >> >> > Will try your suggestion. I will try to describe the actual
> > use-case
> > >> >> itself
> > >> >> >
> > >> >> > There is a requirement for merging time-adjacent segments
> > >> [append-only,
> > >> >> > rolling time-series data]
> > >> >> >
> > >> >> > All Documents have a timestamp affixed and during flush I need to
> > note
> > >> >> down
> > >> >> > the least timestamp for all documents, through Codec.
> > >> >> >
> > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> define
> > the
> > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > >> >> >
> > >> >> > LogMergePolicy will auto-arrange levels of segments according
> time
> > and
> > >> >> > proceed with merges. Latest segments will be lesser in size and
> > >> preferred
> > >> >> > during merges than older and bigger segments
> > >> >> >
> > >> >> > Do you think such an approach will be fine or there are better
> > ways to
> > >> >> > solve this?
> > >> >> >
> > >> >> > --
> > >> >> > Ravi
> > >> >> >
> > >> >> >
> > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > >> >> > lucene@mikemccandless.com> wrote:
> > >> >> >
> > >> >> >> Somewhere in those numeric trie terms are the exact integers
> from
> > >> your
> > >> >> >> documents, encoded.
> > >> >> >>
> > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the
> int
> > >> >> >> value back from the BytesRef term.
> > >> >> >>
> > >> >> >> But you need to filter out the "higher level" terms, e.g. using
> > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > believe
> > >> >> >> all the terms you want come first, so once you hit a term where
> > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you
> can
> > >> stop
> > >> >> >> checking.
> > >> >> >>
> > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> > that
> > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > yourself.
> > >> >> >>
> > >> >> >> Mike McCandless
> > >> >> >>
> > >> >> >> http://blog.mikemccandless.com
> > >> >> >>
> > >> >> >>
> > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > >> >> >> <ra...@gmail.com> wrote:
> > >> >> >> > I use a Codec to flush data. All methods delegate to actual
> > >> >> >> Lucene42Codec,
> > >> >> >> > except for intercepting one single-field. This field is
> indexed
> > as
> > >> an
> > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > >> >> >> >
> > >> >> >> > The purpose of the Codec is as follows
> > >> >> >> >
> > >> >> >> > 1. Note the first BytesRef for this field
> > >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> > >> BytesRef
> > >> >> for
> > >> >> >> > this field
> > >> >> >> > 3. Converts both the first/last BytesRef to respective
> integers
> > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > >> >> >> >
> > >> >> >> > The problem with this approach is that, first/last BytesRef is
> > >> totally
> > >> >> >> > different from the actual "int" values I try to index. I
> guess,
> > >> this
> > >> >> is
> > >> >> >> > because Numeric-Trie explodes all the integers into it's own
> > >> format of
> > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > >> >> segment-diagnostics
> > >> >> >> >
> > >> >> >> > Is there a way I can record actual min/max int-values
> correctly
> > in
> > >> my
> > >> >> >> codec
> > >> >> >> > and still support NumericRange search?
> > >> >> >> >
> > >> >> >> > --
> > >> >> >> > Ravi
> > >> >> >>
> > >> >> >>
> > ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >> >> >>
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >> >>
> > >> >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

Right, I think you'll need to use either of the LogXMergePolicy (or
subclass LogMergePolicy and make your own): they always pick adjacent
segments to merge.

SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP,
and then sort by timestamp?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 8:16 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Yes exactly as you have described.
>
> Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
> goes for a merge
>
> While SortingMergePolicy will correctly solve the merge-part, it does not
> however play any role in picking segments to merge right?
>
> SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> merge disturbing the global-order. Ideally only "adjacent" segments should
> be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
>
> Can there be a better selection of segments to merge in this case, so as to
> maintain a semblance of global-ordering?
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> OK, I see (early termination).
>>
>> That's a challenge, because you really want the docs sorted backwards
>> from how they were added right?  And, e.g., merged and then searched
>> in "reverse segment order"?
>>
>> I think you should be able to do this w/ SortingMergePolicy?  And then
>> use a custom collector that stops after you've gone back enough in
>> time for a given search.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > Mike,
>> >
>> > All our queries need to be sorted by timestamp field, in descending order
>> > of time. [latest-first]
>> >
>> > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
>> > segments and merges them [even with SortingMergePolicy etc...]. I am
>> trying
>> > to avoid this and see if an approximate global ordering of segments [by
>> > time-stamp field] can be maintained via merge.
>> >
>> > Ex: TopN results will only examine recent 2-3 smaller segments
>> [best-case]
>> > and return, without examining older and bigger segments.
>> >
>> > I do not know the terminology, may be "Early Query Termination Across
>> > Segments" etc...?
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
>> >> order.
>> >>
>> >> Only TieredMergePolicy merges out-of-order segments.
>> >>
>> >> I don't understand why you need to encouraging merging of the more
>> >> recent (by your "time" field) segments...
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
>> >> <ra...@gmail.com> wrote:
>> >> > Mike,
>> >> >
>> >> > Each of my flushed segment is fully ordered by time. But
>> >> TieredMergePolicy
>> >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
>> >> > disturb this arrangement and I wanted some kind of control on this.
>> >> >
>> >> > But like you pointed-out, going by only be time-adjacent merges can be
>> >> > disastrous.
>> >> >
>> >> > Is there a way to mix both time and size to arrive at a somewhat
>> >> > [less-than-accurate] global order of segment merges.
>> >> >
>> >> > Like attempt a time-adjacent merge, provided size of segments is not
>> >> > extremely skewed etc...
>> >> >
>> >> > --
>> >> > Ravi
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
>> >> > lucene@mikemccandless.com> wrote:
>> >> >
>> >> >> You want to focus merging on the segments containing newer documents?
>> >> >> Why?  This seems somewhat dangerous...
>> >> >>
>> >> >> Not taking into account the "true" segment size can lead to very very
>> >> >> poor merge decisions ... you should turn on IndexWriter's infoStream
>> >> >> and do a long running test to convince yourself the merging is being
>> >> >> sane.
>> >> >>
>> >> >> Mike
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> >> >> <ra...@gmail.com> wrote:
>> >> >> > Thanks Mike,
>> >> >> >
>> >> >> > Will try your suggestion. I will try to describe the actual
>> use-case
>> >> >> itself
>> >> >> >
>> >> >> > There is a requirement for merging time-adjacent segments
>> >> [append-only,
>> >> >> > rolling time-series data]
>> >> >> >
>> >> >> > All Documents have a timestamp affixed and during flush I need to
>> note
>> >> >> down
>> >> >> > the least timestamp for all documents, through Codec.
>> >> >> >
>> >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define
>> the
>> >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >> >> >
>> >> >> > LogMergePolicy will auto-arrange levels of segments according time
>> and
>> >> >> > proceed with merges. Latest segments will be lesser in size and
>> >> preferred
>> >> >> > during merges than older and bigger segments
>> >> >> >
>> >> >> > Do you think such an approach will be fine or there are better
>> ways to
>> >> >> > solve this?
>> >> >> >
>> >> >> > --
>> >> >> > Ravi
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> >> >> > lucene@mikemccandless.com> wrote:
>> >> >> >
>> >> >> >> Somewhere in those numeric trie terms are the exact integers from
>> >> your
>> >> >> >> documents, encoded.
>> >> >> >>
>> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> >> >> value back from the BytesRef term.
>> >> >> >>
>> >> >> >> But you need to filter out the "higher level" terms, e.g. using
>> >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
>> believe
>> >> >> >> all the terms you want come first, so once you hit a term where
>> >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
>> >> stop
>> >> >> >> checking.
>> >> >> >>
>> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
>> that
>> >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
>> yourself.
>> >> >> >>
>> >> >> >> Mike McCandless
>> >> >> >>
>> >> >> >> http://blog.mikemccandless.com
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> >> >> <ra...@gmail.com> wrote:
>> >> >> >> > I use a Codec to flush data. All methods delegate to actual
>> >> >> >> Lucene42Codec,
>> >> >> >> > except for intercepting one single-field. This field is indexed
>> as
>> >> an
>> >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >> >> >
>> >> >> >> > The purpose of the Codec is as follows
>> >> >> >> >
>> >> >> >> > 1. Note the first BytesRef for this field
>> >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
>> >> BytesRef
>> >> >> for
>> >> >> >> > this field
>> >> >> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> >> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >> >> >
>> >> >> >> > The problem with this approach is that, first/last BytesRef is
>> >> totally
>> >> >> >> > different from the actual "int" values I try to index. I guess,
>> >> this
>> >> >> is
>> >> >> >> > because Numeric-Trie explodes all the integers into it's own
>> >> format of
>> >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
>> >> >> segment-diagnostics
>> >> >> >> >
>> >> >> >> > Is there a way I can record actual min/max int-values correctly
>> in
>> >> my
>> >> >> >> codec
>> >> >> >> > and still support NumericRange search?
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Ravi
>> >> >> >>
>> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Yes exactly as you have described.

Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
goes for a merge

While SortingMergePolicy will correctly solve the merge-part, it does not
however play any role in picking segments to merge right?

SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
merge disturbing the global-order. Ideally only "adjacent" segments should
be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...

Can there be a better selection of segments to merge in this case, so as to
maintain a semblance of global-ordering?

--
Ravi



On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> OK, I see (early termination).
>
> That's a challenge, because you really want the docs sorted backwards
> from how they were added right?  And, e.g., merged and then searched
> in "reverse segment order"?
>
> I think you should be able to do this w/ SortingMergePolicy?  And then
> use a custom collector that stops after you've gone back enough in
> time for a given search.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > Mike,
> >
> > All our queries need to be sorted by timestamp field, in descending order
> > of time. [latest-first]
> >
> > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> > segments and merges them [even with SortingMergePolicy etc...]. I am
> trying
> > to avoid this and see if an approximate global ordering of segments [by
> > time-stamp field] can be maintained via merge.
> >
> > Ex: TopN results will only examine recent 2-3 smaller segments
> [best-case]
> > and return, without examining older and bigger segments.
> >
> > I do not know the terminology, may be "Early Query Termination Across
> > Segments" etc...?
> >
> > --
> > Ravi
> >
> >
> > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> >> order.
> >>
> >> Only TieredMergePolicy merges out-of-order segments.
> >>
> >> I don't understand why you need to encouraging merging of the more
> >> recent (by your "time" field) segments...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> >> <ra...@gmail.com> wrote:
> >> > Mike,
> >> >
> >> > Each of my flushed segment is fully ordered by time. But
> >> TieredMergePolicy
> >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> >> > disturb this arrangement and I wanted some kind of control on this.
> >> >
> >> > But like you pointed-out, going by only be time-adjacent merges can be
> >> > disastrous.
> >> >
> >> > Is there a way to mix both time and size to arrive at a somewhat
> >> > [less-than-accurate] global order of segment merges.
> >> >
> >> > Like attempt a time-adjacent merge, provided size of segments is not
> >> > extremely skewed etc...
> >> >
> >> > --
> >> > Ravi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> >> > lucene@mikemccandless.com> wrote:
> >> >
> >> >> You want to focus merging on the segments containing newer documents?
> >> >> Why?  This seems somewhat dangerous...
> >> >>
> >> >> Not taking into account the "true" segment size can lead to very very
> >> >> poor merge decisions ... you should turn on IndexWriter's infoStream
> >> >> and do a long running test to convince yourself the merging is being
> >> >> sane.
> >> >>
> >> >> Mike
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >>
> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> >> >> <ra...@gmail.com> wrote:
> >> >> > Thanks Mike,
> >> >> >
> >> >> > Will try your suggestion. I will try to describe the actual
> use-case
> >> >> itself
> >> >> >
> >> >> > There is a requirement for merging time-adjacent segments
> >> [append-only,
> >> >> > rolling time-series data]
> >> >> >
> >> >> > All Documents have a timestamp affixed and during flush I need to
> note
> >> >> down
> >> >> > the least timestamp for all documents, through Codec.
> >> >> >
> >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define
> the
> >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >> >> >
> >> >> > LogMergePolicy will auto-arrange levels of segments according time
> and
> >> >> > proceed with merges. Latest segments will be lesser in size and
> >> preferred
> >> >> > during merges than older and bigger segments
> >> >> >
> >> >> > Do you think such an approach will be fine or there are better
> ways to
> >> >> > solve this?
> >> >> >
> >> >> > --
> >> >> > Ravi
> >> >> >
> >> >> >
> >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> >> >> > lucene@mikemccandless.com> wrote:
> >> >> >
> >> >> >> Somewhere in those numeric trie terms are the exact integers from
> >> your
> >> >> >> documents, encoded.
> >> >> >>
> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> >> >> value back from the BytesRef term.
> >> >> >>
> >> >> >> But you need to filter out the "higher level" terms, e.g. using
> >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> believe
> >> >> >> all the terms you want come first, so once you hit a term where
> >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
> >> stop
> >> >> >> checking.
> >> >> >>
> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> that
> >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> yourself.
> >> >> >>
> >> >> >> Mike McCandless
> >> >> >>
> >> >> >> http://blog.mikemccandless.com
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> >> >> <ra...@gmail.com> wrote:
> >> >> >> > I use a Codec to flush data. All methods delegate to actual
> >> >> >> Lucene42Codec,
> >> >> >> > except for intercepting one single-field. This field is indexed
> as
> >> an
> >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >> >> >
> >> >> >> > The purpose of the Codec is as follows
> >> >> >> >
> >> >> >> > 1. Note the first BytesRef for this field
> >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> >> BytesRef
> >> >> for
> >> >> >> > this field
> >> >> >> > 3. Converts both the first/last BytesRef to respective integers
> >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >> >> >
> >> >> >> > The problem with this approach is that, first/last BytesRef is
> >> totally
> >> >> >> > different from the actual "int" values I try to index. I guess,
> >> this
> >> >> is
> >> >> >> > because Numeric-Trie explodes all the integers into it's own
> >> format of
> >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> >> >> segment-diagnostics
> >> >> >> >
> >> >> >> > Is there a way I can record actual min/max int-values correctly
> in
> >> my
> >> >> >> codec
> >> >> >> > and still support NumericRange search?
> >> >> >> >
> >> >> >> > --
> >> >> >> > Ravi
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK, I see (early termination).

That's a challenge, because you really want the docs sorted backwards
from how they were added right?  And, e.g., merged and then searched
in "reverse segment order"?

I think you should be able to do this w/ SortingMergePolicy?  And then
use a custom collector that stops after you've gone back enough in
time for a given search.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Mike,
>
> All our queries need to be sorted by timestamp field, in descending order
> of time. [latest-first]
>
> Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> segments and merges them [even with SortingMergePolicy etc...]. I am trying
> to avoid this and see if an approximate global ordering of segments [by
> time-stamp field] can be maintained via merge.
>
> Ex: TopN results will only examine recent 2-3 smaller segments [best-case]
> and return, without examining older and bigger segments.
>
> I do not know the terminology, may be "Early Query Termination Across
> Segments" etc...?
>
> --
> Ravi
>
>
> On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
>> order.
>>
>> Only TieredMergePolicy merges out-of-order segments.
>>
>> I don't understand why you need to encouraging merging of the more
>> recent (by your "time" field) segments...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > Mike,
>> >
>> > Each of my flushed segment is fully ordered by time. But
>> TieredMergePolicy
>> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
>> > disturb this arrangement and I wanted some kind of control on this.
>> >
>> > But like you pointed-out, going by only be time-adjacent merges can be
>> > disastrous.
>> >
>> > Is there a way to mix both time and size to arrive at a somewhat
>> > [less-than-accurate] global order of segment merges.
>> >
>> > Like attempt a time-adjacent merge, provided size of segments is not
>> > extremely skewed etc...
>> >
>> > --
>> > Ravi
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> You want to focus merging on the segments containing newer documents?
>> >> Why?  This seems somewhat dangerous...
>> >>
>> >> Not taking into account the "true" segment size can lead to very very
>> >> poor merge decisions ... you should turn on IndexWriter's infoStream
>> >> and do a long running test to convince yourself the merging is being
>> >> sane.
>> >>
>> >> Mike
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> >> <ra...@gmail.com> wrote:
>> >> > Thanks Mike,
>> >> >
>> >> > Will try your suggestion. I will try to describe the actual use-case
>> >> itself
>> >> >
>> >> > There is a requirement for merging time-adjacent segments
>> [append-only,
>> >> > rolling time-series data]
>> >> >
>> >> > All Documents have a timestamp affixed and during flush I need to note
>> >> down
>> >> > the least timestamp for all documents, through Codec.
>> >> >
>> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
>> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >> >
>> >> > LogMergePolicy will auto-arrange levels of segments according time and
>> >> > proceed with merges. Latest segments will be lesser in size and
>> preferred
>> >> > during merges than older and bigger segments
>> >> >
>> >> > Do you think such an approach will be fine or there are better ways to
>> >> > solve this?
>> >> >
>> >> > --
>> >> > Ravi
>> >> >
>> >> >
>> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> >> > lucene@mikemccandless.com> wrote:
>> >> >
>> >> >> Somewhere in those numeric trie terms are the exact integers from
>> your
>> >> >> documents, encoded.
>> >> >>
>> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> >> value back from the BytesRef term.
>> >> >>
>> >> >> But you need to filter out the "higher level" terms, e.g. using
>> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> >> >> all the terms you want come first, so once you hit a term where
>> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
>> stop
>> >> >> checking.
>> >> >>
>> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
>> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> >> <ra...@gmail.com> wrote:
>> >> >> > I use a Codec to flush data. All methods delegate to actual
>> >> >> Lucene42Codec,
>> >> >> > except for intercepting one single-field. This field is indexed as
>> an
>> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >> >
>> >> >> > The purpose of the Codec is as follows
>> >> >> >
>> >> >> > 1. Note the first BytesRef for this field
>> >> >> > 2. During finish() call [TermsConsumer.java], note the last
>> BytesRef
>> >> for
>> >> >> > this field
>> >> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >> >
>> >> >> > The problem with this approach is that, first/last BytesRef is
>> totally
>> >> >> > different from the actual "int" values I try to index. I guess,
>> this
>> >> is
>> >> >> > because Numeric-Trie explodes all the integers into it's own
>> format of
>> >> >> > BytesRefs. Hence my Codec stores the wrong values in
>> >> segment-diagnostics
>> >> >> >
>> >> >> > Is there a way I can record actual min/max int-values correctly in
>> my
>> >> >> codec
>> >> >> > and still support NumericRange search?
>> >> >> >
>> >> >> > --
>> >> >> > Ravi
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Mike,

All our queries need to be sorted by timestamp field, in descending order
of time. [latest-first]

Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
segments and merges them [even with SortingMergePolicy etc...]. I am trying
to avoid this and see if an approximate global ordering of segments [by
time-stamp field] can be maintained via merge.

Ex: TopN results will only examine recent 2-3 smaller segments [best-case]
and return, without examining older and bigger segments.

I do not know the terminology, may be "Early Query Termination Across
Segments" etc...?

--
Ravi


On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> order.
>
> Only TieredMergePolicy merges out-of-order segments.
>
> I don't understand why you need to encouraging merging of the more
> recent (by your "time" field) segments...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > Mike,
> >
> > Each of my flushed segment is fully ordered by time. But
> TieredMergePolicy
> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> > disturb this arrangement and I wanted some kind of control on this.
> >
> > But like you pointed-out, going by only be time-adjacent merges can be
> > disastrous.
> >
> > Is there a way to mix both time and size to arrive at a somewhat
> > [less-than-accurate] global order of segment merges.
> >
> > Like attempt a time-adjacent merge, provided size of segments is not
> > extremely skewed etc...
> >
> > --
> > Ravi
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> You want to focus merging on the segments containing newer documents?
> >> Why?  This seems somewhat dangerous...
> >>
> >> Not taking into account the "true" segment size can lead to very very
> >> poor merge decisions ... you should turn on IndexWriter's infoStream
> >> and do a long running test to convince yourself the merging is being
> >> sane.
> >>
> >> Mike
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> >> <ra...@gmail.com> wrote:
> >> > Thanks Mike,
> >> >
> >> > Will try your suggestion. I will try to describe the actual use-case
> >> itself
> >> >
> >> > There is a requirement for merging time-adjacent segments
> [append-only,
> >> > rolling time-series data]
> >> >
> >> > All Documents have a timestamp affixed and during flush I need to note
> >> down
> >> > the least timestamp for all documents, through Codec.
> >> >
> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >> >
> >> > LogMergePolicy will auto-arrange levels of segments according time and
> >> > proceed with merges. Latest segments will be lesser in size and
> preferred
> >> > during merges than older and bigger segments
> >> >
> >> > Do you think such an approach will be fine or there are better ways to
> >> > solve this?
> >> >
> >> > --
> >> > Ravi
> >> >
> >> >
> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> >> > lucene@mikemccandless.com> wrote:
> >> >
> >> >> Somewhere in those numeric trie terms are the exact integers from
> your
> >> >> documents, encoded.
> >> >>
> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> >> value back from the BytesRef term.
> >> >>
> >> >> But you need to filter out the "higher level" terms, e.g. using
> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> >> >> all the terms you want come first, so once you hit a term where
> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
> stop
> >> >> checking.
> >> >>
> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >>
> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> >> <ra...@gmail.com> wrote:
> >> >> > I use a Codec to flush data. All methods delegate to actual
> >> >> Lucene42Codec,
> >> >> > except for intercepting one single-field. This field is indexed as
> an
> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >> >
> >> >> > The purpose of the Codec is as follows
> >> >> >
> >> >> > 1. Note the first BytesRef for this field
> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> BytesRef
> >> for
> >> >> > this field
> >> >> > 3. Converts both the first/last BytesRef to respective integers
> >> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >> >
> >> >> > The problem with this approach is that, first/last BytesRef is
> totally
> >> >> > different from the actual "int" values I try to index. I guess,
> this
> >> is
> >> >> > because Numeric-Trie explodes all the integers into it's own
> format of
> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> >> segment-diagnostics
> >> >> >
> >> >> > Is there a way I can record actual min/max int-values correctly in
> my
> >> >> codec
> >> >> > and still support NumericRange search?
> >> >> >
> >> >> > --
> >> >> > Ravi
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order.

Only TieredMergePolicy merges out-of-order segments.

I don't understand why you need to encouraging merging of the more
recent (by your "time" field) segments...

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Mike,
>
> Each of my flushed segment is fully ordered by time. But TieredMergePolicy
> or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> disturb this arrangement and I wanted some kind of control on this.
>
> But like you pointed-out, going by only be time-adjacent merges can be
> disastrous.
>
> Is there a way to mix both time and size to arrive at a somewhat
> [less-than-accurate] global order of segment merges.
>
> Like attempt a time-adjacent merge, provided size of segments is not
> extremely skewed etc...
>
> --
> Ravi
>
>
>
>
>
>
>
> On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> You want to focus merging on the segments containing newer documents?
>> Why?  This seems somewhat dangerous...
>>
>> Not taking into account the "true" segment size can lead to very very
>> poor merge decisions ... you should turn on IndexWriter's infoStream
>> and do a long running test to convince yourself the merging is being
>> sane.
>>
>> Mike
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > Thanks Mike,
>> >
>> > Will try your suggestion. I will try to describe the actual use-case
>> itself
>> >
>> > There is a requirement for merging time-adjacent segments [append-only,
>> > rolling time-series data]
>> >
>> > All Documents have a timestamp affixed and during flush I need to note
>> down
>> > the least timestamp for all documents, through Codec.
>> >
>> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
>> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >
>> > LogMergePolicy will auto-arrange levels of segments according time and
>> > proceed with merges. Latest segments will be lesser in size and preferred
>> > during merges than older and bigger segments
>> >
>> > Do you think such an approach will be fine or there are better ways to
>> > solve this?
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> Somewhere in those numeric trie terms are the exact integers from your
>> >> documents, encoded.
>> >>
>> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> value back from the BytesRef term.
>> >>
>> >> But you need to filter out the "higher level" terms, e.g. using
>> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> >> all the terms you want come first, so once you hit a term where
>> >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
>> >> checking.
>> >>
>> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
>> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> <ra...@gmail.com> wrote:
>> >> > I use a Codec to flush data. All methods delegate to actual
>> >> Lucene42Codec,
>> >> > except for intercepting one single-field. This field is indexed as an
>> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >
>> >> > The purpose of the Codec is as follows
>> >> >
>> >> > 1. Note the first BytesRef for this field
>> >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef
>> for
>> >> > this field
>> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >
>> >> > The problem with this approach is that, first/last BytesRef is totally
>> >> > different from the actual "int" values I try to index. I guess, this
>> is
>> >> > because Numeric-Trie explodes all the integers into it's own format of
>> >> > BytesRefs. Hence my Codec stores the wrong values in
>> segment-diagnostics
>> >> >
>> >> > Is there a way I can record actual min/max int-values correctly in my
>> >> codec
>> >> > and still support NumericRange search?
>> >> >
>> >> > --
>> >> > Ravi
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Mike,

Each of my flushed segment is fully ordered by time. But TieredMergePolicy
or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
disturb this arrangement and I wanted some kind of control on this.

But like you pointed-out, going by only be time-adjacent merges can be
disastrous.

Is there a way to mix both time and size to arrive at a somewhat
[less-than-accurate] global order of segment merges.

Like attempt a time-adjacent merge, provided size of segments is not
extremely skewed etc...

--
Ravi







On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You want to focus merging on the segments containing newer documents?
> Why?  This seems somewhat dangerous...
>
> Not taking into account the "true" segment size can lead to very very
> poor merge decisions ... you should turn on IndexWriter's infoStream
> and do a long running test to convince yourself the merging is being
> sane.
>
> Mike
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > Thanks Mike,
> >
> > Will try your suggestion. I will try to describe the actual use-case
> itself
> >
> > There is a requirement for merging time-adjacent segments [append-only,
> > rolling time-series data]
> >
> > All Documents have a timestamp affixed and during flush I need to note
> down
> > the least timestamp for all documents, through Codec.
> >
> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >
> > LogMergePolicy will auto-arrange levels of segments according time and
> > proceed with merges. Latest segments will be lesser in size and preferred
> > during merges than older and bigger segments
> >
> > Do you think such an approach will be fine or there are better ways to
> > solve this?
> >
> > --
> > Ravi
> >
> >
> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Somewhere in those numeric trie terms are the exact integers from your
> >> documents, encoded.
> >>
> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> value back from the BytesRef term.
> >>
> >> But you need to filter out the "higher level" terms, e.g. using
> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> >> all the terms you want come first, so once you hit a term where
> >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
> >> checking.
> >>
> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> <ra...@gmail.com> wrote:
> >> > I use a Codec to flush data. All methods delegate to actual
> >> Lucene42Codec,
> >> > except for intercepting one single-field. This field is indexed as an
> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >
> >> > The purpose of the Codec is as follows
> >> >
> >> > 1. Note the first BytesRef for this field
> >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef
> for
> >> > this field
> >> > 3. Converts both the first/last BytesRef to respective integers
> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >
> >> > The problem with this approach is that, first/last BytesRef is totally
> >> > different from the actual "int" values I try to index. I guess, this
> is
> >> > because Numeric-Trie explodes all the integers into it's own format of
> >> > BytesRefs. Hence my Codec stores the wrong values in
> segment-diagnostics
> >> >
> >> > Is there a way I can record actual min/max int-values correctly in my
> >> codec
> >> > and still support NumericRange search?
> >> >
> >> > --
> >> > Ravi
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

You want to focus merging on the segments containing newer documents?
Why?  This seems somewhat dangerous...

Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself the merging is being
sane.

Mike

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> Thanks Mike,
>
> Will try your suggestion. I will try to describe the actual use-case itself
>
> There is a requirement for merging time-adjacent segments [append-only,
> rolling time-series data]
>
> All Documents have a timestamp affixed and during flush I need to note down
> the least timestamp for all documents, through Codec.
>
> Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>
> LogMergePolicy will auto-arrange levels of segments according time and
> proceed with merges. Latest segments will be lesser in size and preferred
> during merges than older and bigger segments
>
> Do you think such an approach will be fine or there are better ways to
> solve this?
>
> --
> Ravi
>
>
> On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Somewhere in those numeric trie terms are the exact integers from your
>> documents, encoded.
>>
>> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> value back from the BytesRef term.
>>
>> But you need to filter out the "higher level" terms, e.g. using
>> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> all the terms you want come first, so once you hit a term where
>> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
>> checking.
>>
>> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
>> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> <ra...@gmail.com> wrote:
>> > I use a Codec to flush data. All methods delegate to actual
>> Lucene42Codec,
>> > except for intercepting one single-field. This field is indexed as an
>> > IntField [Numeric-Trie...], with precisionStep=4.
>> >
>> > The purpose of the Codec is as follows
>> >
>> > 1. Note the first BytesRef for this field
>> > 2. During finish() call [TermsConsumer.java], note the last BytesRef for
>> > this field
>> > 3. Converts both the first/last BytesRef to respective integers
>> > 4. Store these 2 ints in segment-info diagnostics
>> >
>> > The problem with this approach is that, first/last BytesRef is totally
>> > different from the actual "int" values I try to index. I guess, this is
>> > because Numeric-Trie explodes all the integers into it's own format of
>> > BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics
>> >
>> > Is there a way I can record actual min/max int-values correctly in my
>> codec
>> > and still support NumericRange search?
>> >
>> > --
>> > Ravi
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Posted by Ravikumar Govindarajan <ra...@gmail.com>.

Thanks Mike,

Will try your suggestion. I will try to describe the actual use-case itself

There is a requirement for merging time-adjacent segments [append-only,
rolling time-series data]

All Documents have a timestamp affixed and during flush I need to note down
the least timestamp for all documents, through Codec.

Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].

LogMergePolicy will auto-arrange levels of segments according time and
proceed with merges. Latest segments will be lesser in size and preferred
during merges than older and bigger segments

Do you think such an approach will be fine or there are better ways to
solve this?

--
Ravi


On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Somewhere in those numeric trie terms are the exact integers from your
> documents, encoded.
>
> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> value back from the BytesRef term.
>
> But you need to filter out the "higher level" terms, e.g. using
> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> all the terms you want come first, so once you hit a term where
> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
> checking.
>
> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> <ra...@gmail.com> wrote:
> > I use a Codec to flush data. All methods delegate to actual
> Lucene42Codec,
> > except for intercepting one single-field. This field is indexed as an
> > IntField [Numeric-Trie...], with precisionStep=4.
> >
> > The purpose of the Codec is as follows
> >
> > 1. Note the first BytesRef for this field
> > 2. During finish() call [TermsConsumer.java], note the last BytesRef for
> > this field
> > 3. Converts both the first/last BytesRef to respective integers
> > 4. Store these 2 ints in segment-info diagnostics
> >
> > The problem with this approach is that, first/last BytesRef is totally
> > different from the actual "int" values I try to index. I guess, this is
> > because Numeric-Trie explodes all the integers into it's own format of
> > BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics
> >
> > Is there a way I can record actual min/max int-values correctly in my
> codec
> > and still support NumericRange search?
> >
> > --
> > Ravi
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Actual min and max-value of NumericField during codec flush

Posted by Michael McCandless <lu...@mikemccandless.com>.

Somewhere in those numeric trie terms are the exact integers from your
documents, encoded.

You can use oal.util.NumericUtils.prefixCodecToInt to get the int
value back from the BytesRef term.

But you need to filter out the "higher level" terms, e.g. using
NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
all the terms you want come first, so once you hit a term where
.getPrefixCodedLongShift is > 0, that's your max term and you can stop
checking.

BTW, in 5.0, the codec API for PostingsFormat has improved, so that
you can e.g. pull your own TermsEnum and iterate the terms yourself.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
<ra...@gmail.com> wrote:
> I use a Codec to flush data. All methods delegate to actual Lucene42Codec,
> except for intercepting one single-field. This field is indexed as an
> IntField [Numeric-Trie...], with precisionStep=4.
>
> The purpose of the Codec is as follows
>
> 1. Note the first BytesRef for this field
> 2. During finish() call [TermsConsumer.java], note the last BytesRef for
> this field
> 3. Converts both the first/last BytesRef to respective integers
> 4. Store these 2 ints in segment-info diagnostics
>
> The problem with this approach is that, first/last BytesRef is totally
> different from the actual "int" values I try to index. I guess, this is
> because Numeric-Trie explodes all the integers into it's own format of
> BytesRefs. Hence my Codec stores the wrong values in segment-diagnostics
>
> Is there a way I can record actual min/max int-values correctly in my codec
> and still support NumericRange search?
>
> --
> Ravi

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org