You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/11/18 11:33:44 UTC

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

     [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1458:
---------------------------------------

    Attachment: LUCENE-1458.patch

> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

Ahh I see; you're right, TermEnum could make use of MutableString, or  
even a simple char[], for reuse.

But I wonder how much this'd help typical apps.  Ie, most queries  
spend nearly all of their time iterating through doc/position  
postings?  Looking up the term should be a smallish part of the cost.

RangeQuery, and populating the FieldCache, seem to be the be the parts  
that'd gain from reuse

At least one step forward in LUCENE-1458 is the TermsEnum now produces  
String instead of Term, since with the new API you interact with a  
field at a time.

Mike

Jason Rutherglen wrote:

> I was thinking of any API that returns strings like TermEnum  
> returning Term which contains the text string for example.  Unless  
> we're returning Tokens now instead of Terms?
>
> On Wed, Nov 19, 2008 at 1:30 PM, Michael McCandless <lucene@mikemccandless.com 
> > wrote:
>
> MutableString looks cool but totally different from flexible indexing.
>
> Mike
>
>
> Jason Rutherglen wrote:
>
> On a side note, and I have not looked at the flexible indexing API  
> enough to know if there is some equivalent but are we moving to  
> something like MG4J's MutableString http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html 
>  instead of java.lang.String objects?
>
> On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA) <jira@apache.org 
> > wrote:
>
>    [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>  ]
>
> Michael McCandless updated LUCENE-1458:
> ---------------------------------------
>
>   Attachment: LUCENE-1458.patch
>
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> >                 Key: LUCENE-1458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.9
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >             Fix For: 2.9
> >
> >         Attachments: LUCENE-1458.patch
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback.  All tests pass, though back compat tests don't pass due  
> to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many  
> times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch  
> on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use  
> the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> >   * Switches to a new more efficient terms dict format.  This still
> >     uses tii/tis files, but the tii only stores term & long offset
> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >     are structured by field, so we don't have to record field number
> >     in every term.
> > .
> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> >     RAM usage when loading terms dict index is significantly less
> >     since we only load an array of offsets and an array of String  
> (no
> >     more TermInfo array).  It should be faster to init too.
> > .
> >     This part is basically done.
> >   * Introduces modular reader codec that strongly decouples terms  
> dict
> >     from docs/positions readers.  EG there is no more TermInfo used
> >     when reading the new format.
> > .
> >     There's nice symmetry now between reading & writing in the codec
> >     chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> >     This part is basically done.
> >   * Introduces a new "flex" API for iterating through the fields,
> >     terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates  
> the
> >     old API on top of the new API to keep back-compat.
> >
> > Next steps:
> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >     fix any hidden assumptions.
> >   * Expose new API out of IndexReader, deprecate old API but emulate
> >     old API on top of new one, switch all core/contrib users to the
> >     new API.
> >   * Maybe switch to AttributeSources as the base class for  
> TermsEnum,
> >     DocsEnum, PostingsEnum -- this would give readers API  
> flexibility
> >     (not just index-file-format flexibility).  EG if someone wanted
> >     to store payload at the term-doc level instead of
> >     term-doc-position level, you could just add a new attribute.
> >   * Test performance & iterate.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Posted by Jason Rutherglen <ja...@gmail.com>.

I was thinking of any API that returns strings like TermEnum returning Term
which contains the text string for example.  Unless we're returning Tokens
now instead of Terms?

On Wed, Nov 19, 2008 at 1:30 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> MutableString looks cool but totally different from flexible indexing.
>
> Mike
>
>
> Jason Rutherglen wrote:
>
>  On a side note, and I have not looked at the flexible indexing API enough
>> to know if there is some equivalent but are we moving to something like
>> MG4J's MutableString
>> http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html instead
>> of java.lang.String objects?
>>
>> On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA) <
>> jira@apache.org> wrote:
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>  ]
>>
>> Michael McCandless updated LUCENE-1458:
>> ---------------------------------------
>>
>>   Attachment: LUCENE-1458.patch
>>
>> > Further steps towards flexible indexing
>> > ---------------------------------------
>> >
>> >                 Key: LUCENE-1458
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Index
>> >    Affects Versions: 2.9
>> >            Reporter: Michael McCandless
>> >            Assignee: Michael McCandless
>> >            Priority: Minor
>> >             Fix For: 2.9
>> >
>> >         Attachments: LUCENE-1458.patch
>> >
>> >
>> > I attached a very rough checkpoint of my current patch, to get early
>> > feedback.  All tests pass, though back compat tests don't pass due to
>> > changes to package-private APIs plus certain bugs in tests that
>> > happened to work (eg call TermPostions.nextPosition() too many times,
>> > which the new API asserts against).
>> > [Aside: I think, when we commit changes to package-private APIs such
>> > that back-compat tests don't pass, we could go back, make a branch on
>> > the back-compat tag, commit changes to the tests to use the new
>> > package private APIs on that branch, then fix nightly build to use the
>> > tip of that branch?o]
>> > There's still plenty to do before this is committable! This is a
>> > rather large change:
>> >   * Switches to a new more efficient terms dict format.  This still
>> >     uses tii/tis files, but the tii only stores term & long offset
>> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
>> >     are structured by field, so we don't have to record field number
>> >     in every term.
>> > .
>> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> > .
>> >     RAM usage when loading terms dict index is significantly less
>> >     since we only load an array of offsets and an array of String (no
>> >     more TermInfo array).  It should be faster to init too.
>> > .
>> >     This part is basically done.
>> >   * Introduces modular reader codec that strongly decouples terms dict
>> >     from docs/positions readers.  EG there is no more TermInfo used
>> >     when reading the new format.
>> > .
>> >     There's nice symmetry now between reading & writing in the codec
>> >     chain -- the current docs/prox format is captured in:
>> > {code}
>> > FormatPostingsTermsDictWriter/Reader
>> > FormatPostingsDocsWriter/Reader (.frq file) and
>> > FormatPostingsPositionsWriter/Reader (.prx file).
>> > {code}
>> >     This part is basically done.
>> >   * Introduces a new "flex" API for iterating through the fields,
>> >     terms, docs and positions:
>> > {code}
>> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>> > {code}
>> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>> >     old API on top of the new API to keep back-compat.
>> >
>> > Next steps:
>> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>> >     fix any hidden assumptions.
>> >   * Expose new API out of IndexReader, deprecate old API but emulate
>> >     old API on top of new one, switch all core/contrib users to the
>> >     new API.
>> >   * Maybe switch to AttributeSources as the base class for TermsEnum,
>> >     DocsEnum, PostingsEnum -- this would give readers API flexibility
>> >     (not just index-file-format flexibility).  EG if someone wanted
>> >     to store payload at the term-doc level instead of
>> >     term-doc-position level, you could just add a new attribute.
>> >   * Test performance & iterate.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Posted by Michael McCandless <lu...@mikemccandless.com>.

MutableString looks cool but totally different from flexible indexing.

Mike

Jason Rutherglen wrote:

> On a side note, and I have not looked at the flexible indexing API  
> enough to know if there is some equivalent but are we moving to  
> something like MG4J's MutableString http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html 
>  instead of java.lang.String objects?
>
> On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA) <jira@apache.org 
> > wrote:
>
>     [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>  ]
>
> Michael McCandless updated LUCENE-1458:
> ---------------------------------------
>
>    Attachment: LUCENE-1458.patch
>
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> >                 Key: LUCENE-1458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.9
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >             Fix For: 2.9
> >
> >         Attachments: LUCENE-1458.patch
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback.  All tests pass, though back compat tests don't pass due  
> to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many  
> times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch  
> on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use  
> the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> >   * Switches to a new more efficient terms dict format.  This still
> >     uses tii/tis files, but the tii only stores term & long offset
> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >     are structured by field, so we don't have to record field number
> >     in every term.
> > .
> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> >     RAM usage when loading terms dict index is significantly less
> >     since we only load an array of offsets and an array of String  
> (no
> >     more TermInfo array).  It should be faster to init too.
> > .
> >     This part is basically done.
> >   * Introduces modular reader codec that strongly decouples terms  
> dict
> >     from docs/positions readers.  EG there is no more TermInfo used
> >     when reading the new format.
> > .
> >     There's nice symmetry now between reading & writing in the codec
> >     chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> >     This part is basically done.
> >   * Introduces a new "flex" API for iterating through the fields,
> >     terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates  
> the
> >     old API on top of the new API to keep back-compat.
> >
> > Next steps:
> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >     fix any hidden assumptions.
> >   * Expose new API out of IndexReader, deprecate old API but emulate
> >     old API on top of new one, switch all core/contrib users to the
> >     new API.
> >   * Maybe switch to AttributeSources as the base class for  
> TermsEnum,
> >     DocsEnum, PostingsEnum -- this would give readers API  
> flexibility
> >     (not just index-file-format flexibility).  EG if someone wanted
> >     to store payload at the term-doc level instead of
> >     term-doc-position level, you could just add a new attribute.
> >   * Test performance & iterate.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Posted by Jason Rutherglen <ja...@gmail.com>.

On a side note, and I have not looked at the flexible indexing API enough to
know if there is some equivalent but are we moving to something like MG4J's
MutableString
http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.htmlinstead
of java.lang.String objects?

On Tue, Nov 18, 2008 at 2:33 AM, Michael McCandless (JIRA)
<ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Michael McCandless updated LUCENE-1458:
> ---------------------------------------
>
>    Attachment: LUCENE-1458.patch
>
> > Further steps towards flexible indexing
> > ---------------------------------------
> >
> >                 Key: LUCENE-1458
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.9
> >            Reporter: Michael McCandless
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >             Fix For: 2.9
> >
> >         Attachments: LUCENE-1458.patch
> >
> >
> > I attached a very rough checkpoint of my current patch, to get early
> > feedback.  All tests pass, though back compat tests don't pass due to
> > changes to package-private APIs plus certain bugs in tests that
> > happened to work (eg call TermPostions.nextPosition() too many times,
> > which the new API asserts against).
> > [Aside: I think, when we commit changes to package-private APIs such
> > that back-compat tests don't pass, we could go back, make a branch on
> > the back-compat tag, commit changes to the tests to use the new
> > package private APIs on that branch, then fix nightly build to use the
> > tip of that branch?o]
> > There's still plenty to do before this is committable! This is a
> > rather large change:
> >   * Switches to a new more efficient terms dict format.  This still
> >     uses tii/tis files, but the tii only stores term & long offset
> >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >     are structured by field, so we don't have to record field number
> >     in every term.
> > .
> >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > .
> >     RAM usage when loading terms dict index is significantly less
> >     since we only load an array of offsets and an array of String (no
> >     more TermInfo array).  It should be faster to init too.
> > .
> >     This part is basically done.
> >   * Introduces modular reader codec that strongly decouples terms dict
> >     from docs/positions readers.  EG there is no more TermInfo used
> >     when reading the new format.
> > .
> >     There's nice symmetry now between reading & writing in the codec
> >     chain -- the current docs/prox format is captured in:
> > {code}
> > FormatPostingsTermsDictWriter/Reader
> > FormatPostingsDocsWriter/Reader (.frq file) and
> > FormatPostingsPositionsWriter/Reader (.prx file).
> > {code}
> >     This part is basically done.
> >   * Introduces a new "flex" API for iterating through the fields,
> >     terms, docs and positions:
> > {code}
> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> > {code}
> >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> >     old API on top of the new API to keep back-compat.
> >
> > Next steps:
> >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >     fix any hidden assumptions.
> >   * Expose new API out of IndexReader, deprecate old API but emulate
> >     old API on top of new one, switch all core/contrib users to the
> >     new API.
> >   * Maybe switch to AttributeSources as the base class for TermsEnum,
> >     DocsEnum, PostingsEnum -- this would give readers API flexibility
> >     (not just index-file-format flexibility).  EG if someone wanted
> >     to store payload at the term-doc level instead of
> >     term-doc-position level, you could just add a new attribute.
> >   * Test performance & iterate.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>