You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ferenczi Jim (JIRA)" <ji...@apache.org> on 2016/08/24 13:32:20 UTC

[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

    [ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935 ] 

Ferenczi Jim edited comment on LUCENE-7423 at 8/24/16 1:32 PM:
---------------------------------------------------------------

Another iteration. I fixed the prefix selection (the term "aa" should not increment the number of terms accounted for the term "a"). This reduces the index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=ex11gzoft89z21le5c93bpets
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=78.562
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.046 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [1 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 tokens] [took 2.321 sec]
      field "field":
        index FST:
          699982 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294379 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8829391 other bytes (109.1 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.257 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
    |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 683.8 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 58.1 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 58.1 KB
        |-- doc base deltas: 29.1 KB
        |-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

{panel:title=EdgeNgram analyzer  min=2 max=5 |borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
    version=7.0.0
    id=8bm8xy2peb5wo3td0ptgwv035
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=224.803
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.130 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0 tokens] [took 3.736 sec]
      field "field":
        index FST:
          699967 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294377 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8836971 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
      field "field-edge":
        index FST:
          265903 bytes
        terms:
          946021 terms
          4693480 bytes (5.0 bytes/term)
        blocks:
          30830 blocks
          26448 terms-only blocks
          16 sub-block-only blocks
          4366 mixed blocks
          6054 floor blocks
          5852 non-floor blocks
          24978 floor sub-blocks
          2954296 term suffix bytes (95.8 suffix-bytes/block)
          990273 term stats bytes (32.1 stats-bytes/block)
          2750060 other bytes (89.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 313
             2: 6051
             3: 21746
             4: 2272
             5: 396
             6: 28
             7: 16
             8: 3
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.319 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
    |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 943.6 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-edge' [BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]: 259.8 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 95.2 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 95.2 KB
        |-- doc base deltas: 47.5 KB
        |-- start pointer deltas: 45.3 KB

No problems were detected with this index.

Took 4.209 sec total.


Total index size: 235722542 bytes
{noformat}
{panel}

{panel:title=AutoPrefix minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
Two indexed fields:
* "field": standard analyzer
* "field-autoprefix": the autoprefix of the field "field" with a minPrefixTerms set to 2.
{noformat}
Indexed 12600000: 52.49 sec
Final Indexed 12696047: 52.717 sec
Optimize...
After force merge: 68.699 sec
Close...
After close: 68.704 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=1gb0m3msddxzckhpfj9lzsneq
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=1gb0m3msddxzckhpfj9lzsnep
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=120.032
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472044414055}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.067 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3034551 terms; 60351742 terms/docs pairs; 0 tokens] [took 2.566 sec]
      field "field-autoprefix":
        index FST:
          152510 bytes
        terms:
          520585 terms
          3436438 bytes (6.6 bytes/term)
        blocks:
          16779 blocks
          12264 terms-only blocks
          1 sub-block-only blocks
          4514 mixed blocks
          3880 floor blocks
          5187 non-floor blocks
          11592 floor sub-blocks
          2140329 term suffix bytes (127.6 suffix-bytes/block)
          539804 term stats bytes (32.2 stats-bytes/block)
          729244 other bytes (43.5 other-bytes/block)
          by prefix length:
             0: 9
             1: 286
             2: 1746
             3: 6942
             4: 5237
             5: 1722
             6: 577
             7: 191
             8: 31
             9: 18
            10: 19
            11: 1
      
      field "field":
        index FST:
          699987 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294384 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8847612 other bytes (109.3 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.281 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 894.8 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 832.9 KB
    |-- format 'AutoPrefix_0' [BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 832.9 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-autoprefix' [BlockTreeTerms(terms=520585,postings=25638522,positions=-1,docs=9493306)]: 149.1 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 148.9 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 61.9 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 61.9 KB
        |-- doc base deltas: 30.5 KB
        |-- start pointer deltas: 29.1 KB

No problems were detected with this index.

Took 2.933 sec total.


Total index size: 125862986 bytes

{noformat}
{panel}

The autoprefix format has better performance than the 2-5 edge ngram solution. It produces 520,585 terms, two times less than the 2-5 edge ngram (1M terms), is faster to build  52.717 sec vs 71.484 sec and the index is smaller (120M vs 225M).



was (Author: jim.ferenczi):
Another iteration. I fixed the prefix selection (the term "aa" should not increment the number of terms accounted for the term "a"). This reduces the index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=ex11gzoft89z21le5c93bpets
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=78.562
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.046 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [1 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 tokens] [took 2.321 sec]
      field "field":
        index FST:
          699982 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294379 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8829391 other bytes (109.1 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.257 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
    |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 683.8 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 58.1 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 58.1 KB
        |-- doc base deltas: 29.1 KB
        |-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

{panel:title=EdgeNgram analyzer  min=2 max=5 |borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
    version=7.0.0
    id=8bm8xy2peb5wo3td0ptgwv035
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=224.803
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.130 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0 tokens] [took 3.736 sec]
      field "field":
        index FST:
          699967 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294377 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8836971 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
      field "field-edge":
        index FST:
          265903 bytes
        terms:
          946021 terms
          4693480 bytes (5.0 bytes/term)
        blocks:
          30830 blocks
          26448 terms-only blocks
          16 sub-block-only blocks
          4366 mixed blocks
          6054 floor blocks
          5852 non-floor blocks
          24978 floor sub-blocks
          2954296 term suffix bytes (95.8 suffix-bytes/block)
          990273 term stats bytes (32.1 stats-bytes/block)
          2750060 other bytes (89.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 313
             2: 6051
             3: 21746
             4: 2272
             5: 396
             6: 28
             7: 16
             8: 3
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.319 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
    |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 943.6 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-edge' [BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]: 259.8 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 95.2 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 95.2 KB
        |-- doc base deltas: 47.5 KB
        |-- start pointer deltas: 45.3 KB

No problems were detected with this index.

Took 4.209 sec total.


Total index size: 235722542 bytes
{noformat}
{label}

{panel:title=AutoPrefix minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}

Two indexed fields:
* "field": standard analyzer
* "field-autoprefix": the autoprefix of the field "field" with a minPrefixTerms set to 2.

{noformat}
Indexed 12600000: 52.49 sec
Final Indexed 12696047: 52.717 sec
Optimize...
After force merge: 68.699 sec
Close...
After close: 68.704 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 id=1gb0m3msddxzckhpfj9lzsneq
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=1gb0m3msddxzckhpfj9lzsnep
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=120.032
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472044414055}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.067 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3034551 terms; 60351742 terms/docs pairs; 0 tokens] [took 2.566 sec]
      field "field-autoprefix":
        index FST:
          152510 bytes
        terms:
          520585 terms
          3436438 bytes (6.6 bytes/term)
        blocks:
          16779 blocks
          12264 terms-only blocks
          1 sub-block-only blocks
          4514 mixed blocks
          3880 floor blocks
          5187 non-floor blocks
          11592 floor sub-blocks
          2140329 term suffix bytes (127.6 suffix-bytes/block)
          539804 term stats bytes (32.2 stats-bytes/block)
          729244 other bytes (43.5 other-bytes/block)
          by prefix length:
             0: 9
             1: 286
             2: 1746
             3: 6942
             4: 5237
             5: 1722
             6: 577
             7: 191
             8: 31
             9: 18
            10: 19
            11: 1
      
      field "field":
        index FST:
          699987 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294384 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8847612 other bytes (109.3 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] [took 0.281 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 894.8 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 832.9 KB
    |-- format 'AutoPrefix_0' [BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 832.9 KB
        |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-autoprefix' [BlockTreeTerms(terms=520585,postings=25638522,positions=-1,docs=9493306)]: 149.1 KB
            |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 148.9 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 61.9 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 61.9 KB
        |-- doc base deltas: 30.5 KB
        |-- start pointer deltas: 29.1 KB

No problems were detected with this index.

Took 2.933 sec total.


Total index size: 125862986 bytes
{noformat}
{label}

The 2-5 edge ngram analyzer produces 946021 terms and takes 225M on disk (70M for the standard field).

The autoprefix format has better performance than the 2-5 edge ngram solution. It produces 520,585 terms, two times less than the 2-5 edge ngram (1M terms), is faster to build  52.717 sec vs 71.484 sec and the index is smaller (120M vs 225M).


> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7423
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/sandbox
>            Reporter: Ferenczi Jim
>            Priority: Minor
>         Attachments: LUCENE-7423.patch, LUCENE-7423.patch
>
>
> The autoprefix terms dict added in https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the replacement for prefix string queries is unclear. The edge ngrams could be used instead but they have a lot of drawbacks and are hard to configure correctly. The completion postings format is also a good replacement but it requires to have a big FST in RAM and it cannot be intersected with other fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query on string fields. It detects prefixes that match "enough" terms and writes auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted the prefixes are flushed on the fly depending on the input. For each prefix we build its corresponding inverted lists using a DocIdSetBuilder. The first pass visits each term of the field TermsEnum only once. When a prefix is flushed from the prefix tree its inverted lists is dumped into a temporary file for further use. This is necessary since the prefixes are not sorted when they are removed from the tree. The selected auto prefixes are sorted at the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with the KeywordAnalyzer and compared the index/merge time and the size of the indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize the same 11M titles. Among the 287M, only 170M are used for the auto prefix fields and the rest is for the regular keyword field. All the auto prefixes were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org