You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Hastings <ha...@gmail.com> on 2017/08/03 15:07:16 UTC

mixed index with commongrams

Hey all, I have yet to run an experiment to test this but was wondering if
anyone knows the answer ahead of time.
If i have an index built with documents before implementing the commongrams
filter, then enable it, and start adding documents that have the
filter/tokenizer applied, will searches that fit the criteria, for example:
"to be or not to be"
will that search still return results form the earlier documents as well as
the new ones?  The idea is that a full re-index is going to be difficult,
so would rather do it over time by replacing large numbers of documents
incrementally.  Thanks,
Dave

Re: mixed index with commongrams

Posted by David Hastings <ha...@gmail.com>.

Haven't really looked much into that, here is a snipped form todays gc log,
if you wouldn't mind shedding any details on it:

2017-08-03T11:46:16.265-0400: 3200938.383: [GC (Allocation Failure)
2017-08-03T11:46:16.265-0400: 3200938.383: [ParNew
Desired survivor size 1966060336 bytes, new threshold 8 (max 8)
- age   1:  128529184 bytes,  128529184 total
- age   2:   43075632 bytes,  171604816 total
- age   3:   64402592 bytes,  236007408 total
- age   4:   35621704 bytes,  271629112 total
- age   5:   44285584 bytes,  315914696 total
- age   6:   45372512 bytes,  361287208 total
- age   7:   41975368 bytes,  403262576 total
- age   8:   72959688 bytes,  476222264 total
: 9133992K->577219K(10666688K), 0.2730329 secs]
23200886K->14693007K(49066688K), 0.2732690 secs] [Times: user=2.01
sys=0.01, real=0.28 secs]
Heap after GC invocations=12835 (full 109):
 par new generation   total 10666688K, used 577219K [0x00007f8023000000,
0x00007f8330400000, 0x00007f8330400000)
  eden space 8533376K,   0% used [0x00007f8023000000, 0x00007f8023000000,
0x00007f822bd60000)
  from space 2133312K,  27% used [0x00007f82ae0b0000, 0x00007f82d1460d98,
0x00007f8330400000)
  to   space 2133312K,   0% used [0x00007f822bd60000, 0x00007f822bd60000,
0x00007f82ae0b0000)
 concurrent mark-sweep generation total 38400000K, used 14115788K
[0x00007f8330400000, 0x00007f8c58000000, 0x00007f8c58000000)
 Metaspace       used 36698K, capacity 37169K, committed 37512K, reserved
38912K
}





On Thu, Aug 3, 2017 at 11:58 AM, Walter Underwood <wu...@wunderwood.org>
wrote:

> How long are your GC pauses? Those affect all queries, so they make the
> 99th percentile slow with queries that should be fast.
>
> The G1 collector has helped our 99th percentile.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2017, at 8:48 AM, David Hastings <ha...@gmail.com>
> wrote:
> >
> > Thanks, thats what i kind of expected.  still debating whether the space
> > increase is worth it, right now Im at .7% of searches taking longer than
> 10
> > seconds, and 6% taking longer than 1, so when i see things like this in
> the
> > morning it bugs me a bit:
> >
> > 2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the
> Courts
> > of Equity of the United States")
> > 2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
> > 2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
> > justice")
> >
> > which could all be annihilated with CG's, at the expense, according to
> HT,
> > of a 40% increase in index size.
> >
> >
> >
> > On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> bq: will that search still return results form the earlier documents
> >> as well as the new ones
> >>
> >> In a word, "no". By definition the analysis chain applied at index
> >> time puts tokens in the index and that's all you have to search
> >> against for the doc unless and until you re-index the document.
> >>
> >> You really have two choices here:
> >> 1> live with the differing results until you get done re-indexing
> >> 2> index to an offline collection and then use, say, collection
> >> aliasing to make the switch atomically.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
> >> <ha...@gmail.com> wrote:
> >>> Hey all, I have yet to run an experiment to test this but was wondering
> >> if
> >>> anyone knows the answer ahead of time.
> >>> If i have an index built with documents before implementing the
> >> commongrams
> >>> filter, then enable it, and start adding documents that have the
> >>> filter/tokenizer applied, will searches that fit the criteria, for
> >> example:
> >>> "to be or not to be"
> >>> will that search still return results form the earlier documents as
> well
> >> as
> >>> the new ones?  The idea is that a full re-index is going to be
> difficult,
> >>> so would rather do it over time by replacing large numbers of documents
> >>> incrementally.  Thanks,
> >>> Dave
> >>
>
>

Re: mixed index with commongrams

Posted by Walter Underwood <wu...@wunderwood.org>.

How long are your GC pauses? Those affect all queries, so they make the 99th percentile slow with queries that should be fast.

The G1 collector has helped our 99th percentile.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2017, at 8:48 AM, David Hastings <ha...@gmail.com> wrote:
> 
> Thanks, thats what i kind of expected.  still debating whether the space
> increase is worth it, right now Im at .7% of searches taking longer than 10
> seconds, and 6% taking longer than 1, so when i see things like this in the
> morning it bugs me a bit:
> 
> 2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the Courts
> of Equity of the United States")
> 2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
> 2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
> justice")
> 
> which could all be annihilated with CG's, at the expense, according to HT,
> of a 40% increase in index size.
> 
> 
> 
> On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> bq: will that search still return results form the earlier documents
>> as well as the new ones
>> 
>> In a word, "no". By definition the analysis chain applied at index
>> time puts tokens in the index and that's all you have to search
>> against for the doc unless and until you re-index the document.
>> 
>> You really have two choices here:
>> 1> live with the differing results until you get done re-indexing
>> 2> index to an offline collection and then use, say, collection
>> aliasing to make the switch atomically.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
>> <ha...@gmail.com> wrote:
>>> Hey all, I have yet to run an experiment to test this but was wondering
>> if
>>> anyone knows the answer ahead of time.
>>> If i have an index built with documents before implementing the
>> commongrams
>>> filter, then enable it, and start adding documents that have the
>>> filter/tokenizer applied, will searches that fit the criteria, for
>> example:
>>> "to be or not to be"
>>> will that search still return results form the earlier documents as well
>> as
>>> the new ones?  The idea is that a full re-index is going to be difficult,
>>> so would rather do it over time by replacing large numbers of documents
>>> incrementally.  Thanks,
>>> Dave
>>

Re: mixed index with commongrams

Posted by David Hastings <ha...@gmail.com>.

Thanks, thats what i kind of expected.  still debating whether the space
increase is worth it, right now Im at .7% of searches taking longer than 10
seconds, and 6% taking longer than 1, so when i see things like this in the
morning it bugs me a bit:

2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the Courts
of Equity of the United States")
2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
justice")

which could all be annihilated with CG's, at the expense, according to HT,
of a 40% increase in index size.



On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson <er...@gmail.com>
wrote:

> bq: will that search still return results form the earlier documents
> as well as the new ones
>
> In a word, "no". By definition the analysis chain applied at index
> time puts tokens in the index and that's all you have to search
> against for the doc unless and until you re-index the document.
>
> You really have two choices here:
> 1> live with the differing results until you get done re-indexing
> 2> index to an offline collection and then use, say, collection
> aliasing to make the switch atomically.
>
> Best,
> Erick
>
> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
> <ha...@gmail.com> wrote:
> > Hey all, I have yet to run an experiment to test this but was wondering
> if
> > anyone knows the answer ahead of time.
> > If i have an index built with documents before implementing the
> commongrams
> > filter, then enable it, and start adding documents that have the
> > filter/tokenizer applied, will searches that fit the criteria, for
> example:
> > "to be or not to be"
> > will that search still return results form the earlier documents as well
> as
> > the new ones?  The idea is that a full re-index is going to be difficult,
> > so would rather do it over time by replacing large numbers of documents
> > incrementally.  Thanks,
> > Dave
>

Re: mixed index with commongrams

Posted by Erick Erickson <er...@gmail.com>.

bq: will that search still return results form the earlier documents
as well as the new ones

In a word, "no". By definition the analysis chain applied at index
time puts tokens in the index and that's all you have to search
against for the doc unless and until you re-index the document.

You really have two choices here:
1> live with the differing results until you get done re-indexing
2> index to an offline collection and then use, say, collection
aliasing to make the switch atomically.

Best,
Erick

On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
<ha...@gmail.com> wrote:
> Hey all, I have yet to run an experiment to test this but was wondering if
> anyone knows the answer ahead of time.
> If i have an index built with documents before implementing the commongrams
> filter, then enable it, and start adding documents that have the
> filter/tokenizer applied, will searches that fit the criteria, for example:
> "to be or not to be"
> will that search still return results form the earlier documents as well as
> the new ones?  The idea is that a full re-index is going to be difficult,
> so would rather do it over time by replacing large numbers of documents
> incrementally.  Thanks,
> Dave