You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Tom Burton-West <tb...@umich.edu> on 2013/08/02 00:40:10 UTC

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Hi all,

OK, I really should have titled the post, "CheckIndex limit with large tvd
files?"

I started a new CheckIndex run about 1:00 pm on Tuesday and it seems to be
stuck again looking at termvectors.
I gave CheckIndex 32GB of memory, turned on GC logging, and echoed STDERR
and STDOUT to a file

It's seems stuck while testing term vectors, but maybe it just takes
several days to test a term vector file that is 343GB.

Yes, I know I said we had term vectors turned off.  I forgot that we were
using a slightly modified version of the schema we use when we index
individual books on a page level.  We are using the fast-vector
highlighter, so we have termvectors turned on:

 <fieldType name="FullText" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false" stored="true"
termVectors="true" termPositions="true" termOffsets="true"
omitNorms="false">

I've appended a listing of the top memory users from pmap below.

Looks like the *tvd file is using about 300GB of virtual memory, followed
by the *doc,*fdt and *pos files.

Since we have never run CheckIndex on large indexes with term vectors
before, we have no idea how long we should expect it to take.

Our normal page-level book indexes generally hold about 1,000 books (about
300,000 documents/pages)  and are 10-15GB total, with the tvf files
totalling about  700 MB and the *tvd files totaling a few hundred K.

Tom

----
The top 10 processes in pmap are:

 total        804,745,732K
00002baaf526c000 300,897,888K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch.tvd
00002b3b4bf1b000 155,250,472K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch_Lucene41_0.doc
00002b88aa709,000 143,788,268K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch.fdt
00002b604fae5,000 139,820,064K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch_Lucene41_0.pos
00002b32e6c10,000 33,554,476K rw---    [ anon ]
00002b81a59ed000 29,196,076K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch_Lucene41_0.tim
00002b3aee31b000 1,315,184K rw---    [ anon ]
00002b889b9b8,000 243,012K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch.nvd
00002b3ae6c39,000 109,276K rw---    [ anon ]
00002bf2b2,804,000  99,272K r--s-
 /htsolr/lss-dev/solrs/4.2/3/core/data/index/_bch.tvx

>
> On Tue, Jul 30, 2013 at 1:06 PM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Thanks Mike, Robert and Adrien,
> >
> > Unfortunately, I killed the processes, so its too late to get a stack
> > trace.  On thing that was suspicious was that top was reporting memory
> use
> > as 20GB res even though I invoked the JVM with java -Xmx10g -Xms10g.
> >
> > I'm going to double the memory, turn on GC logging, and remember to echo
> > STDERR to a log and run it again on one of the indexes.
> > I'll report back as soon as something interesting shows up.  (Probably
> > tomorrow sometime.)
> >
> > Tom
> >
> >
> > On Tue, Jul 30, 2013 at 11:22 AM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Can you get a strack trace so we can see where the thread is stuck?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Robert Muir <rc...@gmail.com>.

Thanks for testing/confirmation!

On Fri, Aug 9, 2013 at 11:15 AM, Tom Burton-West <tb...@umich.edu> wrote:
> Hi Robert,
>
> Thanks for the fix.  Checkindex finished within 24 hours, which is not
> terrible, given the size of this index (about a terabyte)..
>
> Tom
>
> Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
>
> Segments file=segments_e numSegments=2 version=4.2.1 format=
> userData={commitTimeMSec=1374712392103}
>   1 of 2: name=_bch docCount=82946896
>     codec=Lucene42
>     compound=false
>     numFiles=12
>     size (MB)=752,005.689
>     diagnostics = {timestamp=1374657630506, os=Linux,
> os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
> lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
> Inc.}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [12 fields]
>     test: field norms.........OK [3 fields]
>     test: terms, freq, prox...OK [2442919802 terms; 73922320413 terms/docs
> pairs; 109976572432 tokens]
>     test: stored fields.......OK [960417844 total field count; avg 11.579
> fields per doc]
>     test: term vectors........OK [81452262 total vector count; avg 1
> term/freq vector fields per doc]
>     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
> 0 SORTED; 0 SORTED_SET]
>
>   2 of 2: name=_bcg docCount=42021835
>     codec=Lucene42
>     compound=false
>     numFiles=12
>     size (MB)=371,991.272
>     diagnostics = {timestamp=1374612106174, os=Linux,
> os.version=2.6.18-348.12.1.el5, mergeFactor=30, source=merge,
> lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
> Inc.}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [12 fields]
>     test: field norms.........OK [3 fields]
>     test: terms, freq, prox...OK [1435132736 terms; 36134595066 terms/docs
> pairs; 53691487260 tokens]
>     test: stored fields.......OK [483146935 total field count; avg 11.498
> fields per doc]
>     test: term vectors........OK [41299979 total vector count; avg 1
> term/freq vector fields per doc]
>     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
> 0 SORTED; 0 SORTED_SET]
>
> No problems were detected with this index.
>
>
>
>
> On Thu, Aug 8, 2013 at 11:24 AM, Robert Muir <rc...@gmail.com> wrote:
>
>> On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West <tb...@umich.edu>
>> wrote:
>> > Sure I should be able to build a lucene core and give it a try.  I
>> probably
>> > won't run it until tomorrow night though because right now I'm running
>> some
>> > other tests on the machine I would run CheckIndex from and disk I/O (i.e.
>> > CheckIndex) would mess with the tests.
>> >
>> > Do I just need to check out revision  1511014  from branch 4x and build
>> it?
>> >
>> >
>>
>> yes, something like:
>>
>> svn co -r 1511014
>> https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x
>> cd branch_4x/lucene
>> ant
>>
>> this will create a lucene-core-4.5-SNAPSHOT.jar in build/core
>>
>> Thanks!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Robert,

Thanks for the fix.  Checkindex finished within 24 hours, which is not
terrible, given the size of this index (about a terabyte)..

Tom

Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index

Segments file=segments_e numSegments=2 version=4.2.1 format=
userData={commitTimeMSec=1374712392103}
  1 of 2: name=_bch docCount=82946896
    codec=Lucene42
    compound=false
    numFiles=12
    size (MB)=752,005.689
    diagnostics = {timestamp=1374657630506, os=Linux,
os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [12 fields]
    test: field norms.........OK [3 fields]
    test: terms, freq, prox...OK [2442919802 terms; 73922320413 terms/docs
pairs; 109976572432 tokens]
    test: stored fields.......OK [960417844 total field count; avg 11.579
fields per doc]
    test: term vectors........OK [81452262 total vector count; avg 1
term/freq vector fields per doc]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
0 SORTED; 0 SORTED_SET]

  2 of 2: name=_bcg docCount=42021835
    codec=Lucene42
    compound=false
    numFiles=12
    size (MB)=371,991.272
    diagnostics = {timestamp=1374612106174, os=Linux,
os.version=2.6.18-348.12.1.el5, mergeFactor=30, source=merge,
lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [12 fields]
    test: field norms.........OK [3 fields]
    test: terms, freq, prox...OK [1435132736 terms; 36134595066 terms/docs
pairs; 53691487260 tokens]
    test: stored fields.......OK [483146935 total field count; avg 11.498
fields per doc]
    test: term vectors........OK [41299979 total vector count; avg 1
term/freq vector fields per doc]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
0 SORTED; 0 SORTED_SET]

No problems were detected with this index.




On Thu, Aug 8, 2013 at 11:24 AM, Robert Muir <rc...@gmail.com> wrote:

> On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Sure I should be able to build a lucene core and give it a try.  I
> probably
> > won't run it until tomorrow night though because right now I'm running
> some
> > other tests on the machine I would run CheckIndex from and disk I/O (i.e.
> > CheckIndex) would mess with the tests.
> >
> > Do I just need to check out revision  1511014  from branch 4x and build
> it?
> >
> >
>
> yes, something like:
>
> svn co -r 1511014
> https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x
> cd branch_4x/lucene
> ant
>
> this will create a lucene-core-4.5-SNAPSHOT.jar in build/core
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West <tb...@umich.edu> wrote:
> Sure I should be able to build a lucene core and give it a try.  I probably
> won't run it until tomorrow night though because right now I'm running some
> other tests on the machine I would run CheckIndex from and disk I/O (i.e.
> CheckIndex) would mess with the tests.
>
> Do I just need to check out revision  1511014  from branch 4x and build it?
>
>

yes, something like:

svn co -r 1511014 https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x
cd branch_4x/lucene
ant

this will create a lucene-core-4.5-SNAPSHOT.jar in build/core

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Tom Burton-West <tb...@umich.edu>.

Sure I should be able to build a lucene core and give it a try.  I probably
won't run it until tomorrow night though because right now I'm running some
other tests on the machine I would run CheckIndex from and disk I/O (i.e.
CheckIndex) would mess with the tests.

Do I just need to check out revision  1511014  from branch 4x and build it?

Tom


On Thu, Aug 8, 2013 at 10:51 AM, Robert Muir <rc...@gmail.com> wrote:

> Hi Tom, I committed a fix for the root cause
> (https://issues.apache.org/jira/browse/LUCENE-5156).
>
> Thanks for reporting this!
>
> I dont know if its feasible for you to build a lucene-core.jar from
> branch_4x and run checkindex with that jar file to confirm it really
> addresses the issue: if this is possible in any way it would be
> fantastic.
>
> There is nothing wrong with your index: its just a code thing :)
>
> On Thu, Aug 8, 2013 at 10:45 AM, Tom Burton-West <tb...@umich.edu>
> wrote:
> > Hi Robert,
> >
> > I've been running CheckIndex for over a week and it is still working
> > through seekCeil()
> > (See below.)
> >
> > I'm going to kill the CheckIndex.   Admittedly, this index is an unusual
> > one, but at one point we were considering using MLT in our regular index
> > which would result in a large termvectors file, although only about
> 800,000
> > docs per index.  Should we expect to see something similar or with the
> two
> > orders of magnitude decrease in the number of docs, might CheckIndex
> work a
> > bit faster?
> >
> > Tom
> >
> >
> >
> > ---------------------------
> >
> > Started CheckIndex on Tuesday July 30 and it wrote the following to
> STDOUT:
> > Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
> >
> > Segments file=segments_e numSegments=2 version=4.2.1 format=
> > userData={commitTimeMSec=1374712392103}
> >   1 of 2: name=_bch docCount=82946896
> >     codec=Lucene42
> >     compound=false
> >     numFiles=12
> >     size (MB)=752,005.689
> >     diagnostics = {timestamp=1374657630506, os=Linux,
> > os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
> > lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> > mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun
> Microsystems
> > Inc.}
> >     no deletions
> >     test: open reader.........OK
> >     test: fields..............OK [12 fields]
> >     test: field norms.........OK [3 fields]
> >     test: terms, freq, prox...OK [2442919802 terms; 73922320413
> terms/docs
> > pairs; 109976572432 tokens]
> >     test: stored fields.......OK [960417844 total field count; avg 11.579
> > fields per doc]
> >     test: term vectors........[tburtonw@alamo 3]$
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Robert Muir <rc...@gmail.com>.

Hi Tom, I committed a fix for the root cause
(https://issues.apache.org/jira/browse/LUCENE-5156).

Thanks for reporting this!

I dont know if its feasible for you to build a lucene-core.jar from
branch_4x and run checkindex with that jar file to confirm it really
addresses the issue: if this is possible in any way it would be
fantastic.

There is nothing wrong with your index: its just a code thing :)

On Thu, Aug 8, 2013 at 10:45 AM, Tom Burton-West <tb...@umich.edu> wrote:
> Hi Robert,
>
> I've been running CheckIndex for over a week and it is still working
> through seekCeil()
> (See below.)
>
> I'm going to kill the CheckIndex.   Admittedly, this index is an unusual
> one, but at one point we were considering using MLT in our regular index
> which would result in a large termvectors file, although only about 800,000
> docs per index.  Should we expect to see something similar or with the two
> orders of magnitude decrease in the number of docs, might CheckIndex work a
> bit faster?
>
> Tom
>
>
>
> ---------------------------
>
> Started CheckIndex on Tuesday July 30 and it wrote the following to STDOUT:
> Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
>
> Segments file=segments_e numSegments=2 version=4.2.1 format=
> userData={commitTimeMSec=1374712392103}
>   1 of 2: name=_bch docCount=82946896
>     codec=Lucene42
>     compound=false
>     numFiles=12
>     size (MB)=752,005.689
>     diagnostics = {timestamp=1374657630506, os=Linux,
> os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
> lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
> Inc.}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [12 fields]
>     test: field norms.........OK [3 fields]
>     test: terms, freq, prox...OK [2442919802 terms; 73922320413 terms/docs
> pairs; 109976572432 tokens]
>     test: stored fields.......OK [960417844 total field count; avg 11.579
> fields per doc]
>     test: term vectors........[tburtonw@alamo 3]$

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Hi Tom,

I just see that you have Linux with 2.6 kernel.
Have you already -XX:+UseLargePages as performance option enabled and in use?
Solaris 9 has it on by default but with Linux HugePages must be enabled.

http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html

Just an idea.

Regards
Bernd


Am 08.08.2013 16:45, schrieb Tom Burton-West:
> Hi Robert,
> 
> I've been running CheckIndex for over a week and it is still working
> through seekCeil()
> (See below.)
> 
> I'm going to kill the CheckIndex.   Admittedly, this index is an unusual
> one, but at one point we were considering using MLT in our regular index
> which would result in a large termvectors file, although only about 800,000
> docs per index.  Should we expect to see something similar or with the two
> orders of magnitude decrease in the number of docs, might CheckIndex work a
> bit faster?
> 
> Tom
> 
> 
> 
> ---------------------------
> 
> Started CheckIndex on Tuesday July 30 and it wrote the following to STDOUT:
> Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
> 
> Segments file=segments_e numSegments=2 version=4.2.1 format=
> userData={commitTimeMSec=1374712392103}
>   1 of 2: name=_bch docCount=82946896
>     codec=Lucene42
>     compound=false
>     numFiles=12
>     size (MB)=752,005.689
>     diagnostics = {timestamp=1374657630506, os=Linux,
> os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
> lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
> mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
> Inc.}
>     no deletions
>     test: open reader.........OK
>     test: fields..............OK [12 fields]
>     test: field norms.........OK [3 fields]
>     test: terms, freq, prox...OK [2442919802 terms; 73922320413 terms/docs
> pairs; 109976572432 tokens]
>     test: stored fields.......OK [960417844 total field count; avg 11.579
> fields per doc]
>     test: term vectors........[tburtonw@alamo 3]$
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Robert,

I've been running CheckIndex for over a week and it is still working
through seekCeil()
(See below.)

I'm going to kill the CheckIndex.   Admittedly, this index is an unusual
one, but at one point we were considering using MLT in our regular index
which would result in a large termvectors file, although only about 800,000
docs per index.  Should we expect to see something similar or with the two
orders of magnitude decrease in the number of docs, might CheckIndex work a
bit faster?

Tom



---------------------------

Started CheckIndex on Tuesday July 30 and it wrote the following to STDOUT:
Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index

Segments file=segments_e numSegments=2 version=4.2.1 format=
userData={commitTimeMSec=1374712392103}
  1 of 2: name=_bch docCount=82946896
    codec=Lucene42
    compound=false
    numFiles=12
    size (MB)=752,005.689
    diagnostics = {timestamp=1374657630506, os=Linux,
os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun Microsystems
Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [12 fields]
    test: field norms.........OK [3 fields]
    test: terms, freq, prox...OK [2442919802 terms; 73922320413 terms/docs
pairs; 109976572432 tokens]
    test: stored fields.......OK [960417844 total field count; avg 11.579
fields per doc]
    test: term vectors........[tburtonw@alamo 3]$

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Robert Muir <rc...@gmail.com>.

Thanks, this is what I expected. I opened an issue to remove seek by Ord
from this vectors format.
On Aug 2, 2013 2:13 PM, "Tom Burton-West" <tb...@umich.edu> wrote:

> Thanks Robert,
>
> Looks like it switches between seekCeil and seekExact:
>
> "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
> [0x00002b32de0cc000]
> jstack.out3-   java.lang.Thread.State: RUNNABLE
> jstack.out3-    at
>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846)
> jstack.out3-    at
> org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
> jstack.out3-    at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
> jstack.out3:    at
> org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
> jstack.out3-
>
>
>
> "main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
> [0x00002b32de0cc000]
>    java.lang.Thread.State: RUNNABLE
>         at
>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857)
>         at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103)
>         at
> org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
>         at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
>         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
>
> I don't think highlighting is too slow (at least for our small indexes),
> but will take a look at the postingshighligher
>
>
> Tom
>
> >
> >
> > Hi Tom: with this large term vector file its not really 343GB but, as far
> > as checkindex is concerned, its treated as 1000 343MB indexes (maybe
> more,
> > they are compressed also): because each document's term vector is like a
> > little inverted index for the document. Each one is on your large
> full-text
> > field so it has its own term dictionary and "postings" (all those
> > positions/offsets from your doc) to verify. Its probably the case that
> term
> > vectors with huge numbers of unique terms aren't particularly optimized
> for
> > your use-case either: for example seekCeil() operation looks like a
> linear
> > scan to me: and checkindex tests term seeking if the termsenum supports
> ord
> > (which it does). You could probably use jstack to confirm some of this.
> Was
> > highlighting with vectors horribly slow? :)
> >
> > Its off-topic but maybe something like postingshighlighter would be a
> > better fit for you, as it wouldnt duplicate the terms or positions, just
> > encode some offsets into the .pay file.
> >
> > Anyway, In my opinion, we should think about a JIRA issue such that if
> you
> > pass the -verbose flag to checkindex it prints some status information
> > about its progress. We could also think about trying to improve seekCeil
> > for term vector term dictionaries...
> >
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Tom Burton-West <tb...@umich.edu>.

Thanks Robert,

Looks like it switches between seekCeil and seekExact:

"main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
[0x00002b32de0cc000]
jstack.out3-   java.lang.Thread.State: RUNNABLE
jstack.out3-    at
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846)
jstack.out3-    at
org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
jstack.out3:    at
org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
jstack.out3-



"main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
[0x00002b32de0cc000]
   java.lang.Thread.State: RUNNABLE
        at
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857)
        at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103)
        at
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
        at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)

I don't think highlighting is too slow (at least for our small indexes),
but will take a look at the postingshighligher


Tom

>
>
> Hi Tom: with this large term vector file its not really 343GB but, as far
> as checkindex is concerned, its treated as 1000 343MB indexes (maybe more,
> they are compressed also): because each document's term vector is like a
> little inverted index for the document. Each one is on your large full-text
> field so it has its own term dictionary and "postings" (all those
> positions/offsets from your doc) to verify. Its probably the case that term
> vectors with huge numbers of unique terms aren't particularly optimized for
> your use-case either: for example seekCeil() operation looks like a linear
> scan to me: and checkindex tests term seeking if the termsenum supports ord
> (which it does). You could probably use jstack to confirm some of this. Was
> highlighting with vectors horribly slow? :)
>
> Its off-topic but maybe something like postingshighlighter would be a
> better fit for you, as it wouldnt duplicate the terms or positions, just
> encode some offsets into the .pay file.
>
> Anyway, In my opinion, we should think about a JIRA issue such that if you
> pass the -verbose flag to checkindex it prints some status information
> about its progress. We could also think about trying to improve seekCeil
> for term vector term dictionaries...
>

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Aug 1, 2013 at 6:40 PM, Tom Burton-West <tb...@umich.edu> wrote:

> Hi all,
>
> OK, I really should have titled the post, "CheckIndex limit with large tvd
> files?"
>
> I started a new CheckIndex run about 1:00 pm on Tuesday and it seems to be
> stuck again looking at termvectors.
> I gave CheckIndex 32GB of memory, turned on GC logging, and echoed STDERR
> and STDOUT to a file
>
> It's seems stuck while testing term vectors, but maybe it just takes
> several days to test a term vector file that is 343GB.
>

Hi Tom: with this large term vector file its not really 343GB but, as far
as checkindex is concerned, its treated as 1000 343MB indexes (maybe more,
they are compressed also): because each document's term vector is like a
little inverted index for the document. Each one is on your large full-text
field so it has its own term dictionary and "postings" (all those
positions/offsets from your doc) to verify. Its probably the case that term
vectors with huge numbers of unique terms aren't particularly optimized for
your use-case either: for example seekCeil() operation looks like a linear
scan to me: and checkindex tests term seeking if the termsenum supports ord
(which it does). You could probably use jstack to confirm some of this. Was
highlighting with vectors horribly slow? :)

Its off-topic but maybe something like postingshighlighter would be a
better fit for you, as it wouldnt duplicate the terms or positions, just
encode some offsets into the .pay file.

Anyway, In my opinion, we should think about a JIRA issue such that if you
pass the -verbose flag to checkindex it prints some status information
about its progress. We could also think about trying to improve seekCeil
for term vector term dictionaries...