You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2012/03/20 17:15:39 UTC

[jira] [Created] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
-------------------------------------------------------------------------------------

                 Key: LUCENE-3892
                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless
             Fix For: 4.0


On the flex branch we explored a number of possible intblock
encodings, but for whatever reason never brought them to completion.
There are still a number of issues opened with patches in different
states.

Initial results (based on prototype) were excellent (see
http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
).

I think this would make a good GSoC project.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247175#comment-13247175 ] 

Han Jiang edited comment on LUCENE-3892 at 4/5/12 12:05 PM:
------------------------------------------------------------

Hi Mike,
I have changed my proposal a bit, but here are some questions here:

{quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can "just" implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
   	It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
    	It will define how terms,docids,freq,offset are written into posting files.
    	inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
    	It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc.
    	inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that "BlockTermsReader/Writer" have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the "fixed codec"? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

I haven't investigated much about different kinds of queries. What are skipper settings? 

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better. This feature will also come as "phase 2".
                
      was (Author: billy):
    {quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can "just" implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
   	It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
    	It will define how terms,docids,freq,offset are written into posting files.
    	inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
    	It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc.
    	inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that "BlockTermsReader/Writer" have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the "fixed codec"? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

I haven't investigated much about different kinds of queries. What are skipper settings? 

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better? This feature will also come as "phase 2".

                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287936#comment-13287936 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Awesome progress!  Nice to have a dirt path online that we can then
iterate from ...

Hmm, I'm seeing some test failures when I run:
{noformat}
ant test -Dtests.postingsformat=PFor
{noformat}
Eg, TestNRTThreads, TestShardSearching, TestTimeLimitingCollector.

Remember to add the standard copyright headers to each new source
file...

We don't have to do this now, but I wonder if we can share code w/ the
packed ints impl we have, instead generating another one with the .py
source.

TestDemo makes a nice TestMin... I usually start with TestDemo when
testing scary new code, and then it's a huge milestone once TestDemo
passes :)

We should definitely cutover to BlockTree terms dict (I would upgrade
that TODO to a nocommit!).

I suspect that wrapping the blocks byte[] as ByteBuffer and then
IntBuffer is going to be too costly per decode so we should init them
once and re-use (upgrade that TODO to a nocommit).

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396325#comment-13396325 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

On the For patch ... we shouldn't encode/decode numInts right?  It's
always 128?

Up above, in ForFactory, when we readInt() to get numBytes ... it
seems like we could stuff the header numBits into that same int and
save checking that in FORUtil.decompress....

I think there are a few possible ideas to explore to get faster
PFor/For performance:

  * Get more direct access to the file as an int[]; eg MMapDir could
    expose an IntBuffer from its ByteBuffer (saving the initial copy
    into byte[] that we now do).  Or maybe we add
    IndexInput.readInts(int[]) and dir impl can optimize how that's
    done (MMapDir could use Unsafe.copyBytes... except for little
    endian architectures ... we'd probably have to have separate
    specialized decoder rather than letting Int/ByteBuffer do the byte
    swapping).  This would require the whole file stays aligned w/ int
    (eg the header must be 0 mod 4).

  * Copy/share how oal.packed works, i.e. being able to waste a bit to
    have faster decode (eg storing the 7 bit case as byte[], wasting 1
    bit for each value).

  * Skipping: can we partially decode a block?  EG if we are skipping
    and we know we only want values after the 80th one, then we
    shouldn't decode those first 80...

  * Since doc/freq are "aligned", when we store pointers to a given
    spot, eg in the terms dict or in skip data, we should only store
    the offset once (today we store it twice).

  * Alternatively, maybe we should only save skip data on doc/freq
    block boundaries (prox would still need skip-within-block).

  * Maybe we should store doc & frq blocks interleaved in a single
    file (since they are "aligned") and then skip would skip to the
    start of a doc/frq block pair.

Other ideas...?

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240403#comment-13240403 ] 

Han Jiang edited comment on LUCENE-3892 at 3/28/12 1:50 PM:
------------------------------------------------------------

Hi, 
I have submitted my proposal. Comments are welcome!
Also, I made it public: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1
                
      was (Author: billy):
    Hi, I have submitted my proposal. Comments are welcome!
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399883#comment-13399883 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Yes, really interesting. And that should make sense. As far as I know, a method with exception handling may be quite slow than a simple if statement check. Here is part of the result in my test, with Mike's patch:
{noformat}
           OrHighMed        2.53        0.31        2.57        0.13  -13% -   21%
            Wildcard        3.86        0.12        3.94        0.38  -10% -   15%
          OrHighHigh        1.57        0.18        1.61        0.08  -12% -   21%
      TermBGroup1M1P        1.93        0.03        2.48        0.10   21% -   35%
         TermGroup1M        1.37        0.02        1.81        0.05   26% -   37%
        TermBGroup1M        1.17        0.02        1.64        0.07   32% -   47%
                Term        2.92        0.13        4.46        0.23   38% -   68%
{noformat}
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_for_int[].patch

for decompressing phase, replace the use of IntBuffer with a direct int[] to int[] decoder. Method convert() is supposed to be performant enough...coz it is not different from the inner implementation of IntBuffer.get(), i.e.http://massapi.com/source/jdk1.6.0_17/src/java/nio/Bits.java.html, line 193. However, result isn't interesting.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245374#comment-13245374 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

The proposal at
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1
looks great!  Some initial feedback:

  * There are actually more than 2 codecs (eg we also have Lucene3x,
    SimpleText, sep/intblock (abstract), random codecs/postings
    formats for testing...), but our default codec now is Lucene40.

  * I think you can use the existing abstract sep/intblock classes
    (ie, they implement layers like FieldsProducer/Consumer...), and
    then you can "just" implement the required methods (eg to
    encode/decode one int[] block).

  * We may need to tune the skipper settings, based on profiling
    results from skip-intensive (Phrase, And) queries... since it's
    currently geared towards single-doc-at-once encoding.  I don't think
    we should try to make a new skipper impl here... (there is a separate
    issue for that).

  * Maybe explore the combination of pulsing and PForDelta codecs;
    seems like the combination of those two could be important, since
    for low docFreq terms, retrieving the docs is now more
    expensive...

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262715#comment-13262715 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Thank you, Robert! But currently, the maven mirror in China(http://mirrors.redv.com/maven2) is not available. And can we pass a property to ivy to replace the "repo1*" stuff?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3892:
---------------------------------------

    Attachment: LUCENE-3892-BlockTermScorer.patch

I was curious how much the "layers" (SepPostingsReader,
FixedIntBlock.IntIndexInput, ForFactor) between the FOR block decode
and the query scoring were hurting performance, so I wrote a
specialized scorer (BlockTermScorer) for just TermQuery.

The scorer is only used if the postings format is ForPF, and if no
skipping will be done (I didn't implement advance...).

The scorer reaches down and holds on to the decoded int[] buffer, and
then does its own adding up of the doc deltas, reading the next block,
etc.

The baseline is the current branch (not trunk!):

{noformat}
                Task    QPS base StdDev base   QPS patch StdDev patch     Pct diff
            Wildcard       10.31        0.40       10.10        0.17   -7% -    3%
         AndHighHigh        4.90        0.10        4.82        0.15   -6% -    3%
             Prefix3       28.50        1.06       28.11        0.50   -6% -    4%
              IntNRQ        9.72        0.46        9.60        0.57  -11% -    9%
        SloppyPhrase        0.92        0.03        0.92        0.02   -6% -    5%
            PKLookup      106.21        2.54      105.66        2.07   -4% -    3%
              Phrase        1.56        0.00        1.56        0.01   -1% -    0%
              Fuzzy1       90.33        3.48       90.19        2.25   -6% -    6%
              Fuzzy2       29.66        0.61       29.64        0.85   -4% -    4%
          AndHighMed       14.87        0.29       15.02        0.81   -6% -    8%
             Respell       78.83        2.46       79.62        1.54   -3% -    6%
            SpanNear        1.18        0.02        1.19        0.04   -4% -    6%
         TermGroup1M        2.78        0.06        3.28        0.14   10% -   25%
          OrHighHigh        4.19        0.24        5.04        0.20    9% -   32%
           OrHighMed        8.21        0.45        9.87        0.23   11% -   30%
      TermBGroup1M1P        5.11        0.20        6.21        0.26   12% -   31%
        TermBGroup1M        4.49        0.11        5.49        0.27   13% -   31%
                Term        8.89        0.58       11.90        1.52    9% -   61%
{noformat}

Seems like we get a good boost removing the abstractions.

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260422#comment-13260422 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Hi Billy, I'm very excited your proposal is accepted!  Congrats :)  Now the fun work begins...
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262707#comment-13262707 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Yes, and "ant test" is running now. Maybe we can configure something to avoid the ugly hack?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263302#comment-13263302 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

OK, and thanks for the new commit!
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289296#comment-13289296 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Hi Billy,

bq. Can I get it from a wiki dump instead?

You can download it at http://people.apache.org/~mikemccand/enwiki-20120502-lines-1k.txt.lzma

That's ~6.3 GB (compressed) and 28.7 GB (decompressed); it's the 2012/05/02 Wikipedia en export, filtered to plain text and then broken into 33.3 M ~1 KB sized docs.  I can help you get the luceneutil env set up...

{quote}
bq. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec).

Yes, it is expected, actually it scans every block 33 times to estimate metadata such as numFrameBits and numExceptions.
{quote}

OK, in that case I'm surprised it's only ~18% slower!
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287952#comment-13287952 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

bq. Hmm, that means I should remove TestMin.java? This testcase works fine for the patch.

Oh it's fine to keep TestMin now that you wrote it ... I was just saying that TestDemo is the test I run when I want the most trivial test for a new big change.

{quote}
I'm not quite familiar with these sign stuff, shall I change all the 
 "TODO" sign into "nocommit"? Are the signs related to documentation, 
 or just marked to remember not to commit current codes?
{quote}

Sorry - this is just a convention I use: I put a // nocommit comment whenever there's a "blocker" to committing; this way I can grep for nocommit to see what still needs fixing... and towards the end, nocommits will often be downgraded to TODOs since on closer inspection they really don't have to block committing...
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_for.patch
                LUCENE-3892_pfor.patch

The new "3892_pfor" patch fixed some "SuppressingCodec" stuff since last two weeks. And the "3892_for" lazily implements "For" postingsformat based on current codes. These two patches are temporary separated, in order to prevent performance reduction for the sake of method overriding.

Currently, blocksize ranges from 32 to 128 are tested on both two patches. However, for those skipping-intensive queries, there is no significant performance gain when smaller blocksize was applied. 

Here is a previous result for PFor, with blockSize=64, comparing with 128(in brackets):
{noformat}
                Task    QPS Base StdDev Base    QPS PFor StdDev PFor      Pct diff
              Phrase        4.93        0.36        3.10        0.33  -47% -  -25%  (-47% -  -25%)
          AndHighMed       27.92        2.26       19.16        1.72  -42% -  -18%  (-37% -  -15%)
            SpanNear        2.73        0.16        1.96        0.24  -40% -  -14%  (-36% -  -13%)
        SloppyPhrase        4.19        0.21        3.20        0.30  -34% -  -12%  (-30% -   -6%)
            Wildcard       19.44        0.87       17.11        0.94  -20% -   -2%  (-17% -    3%)
         AndHighHigh        7.50        0.38        6.61        0.59  -23% -    1%  (-19% -    6%)
              IntNRQ        4.06        0.52        3.88        0.35  -22% -   19%  (-16% -   24%)
             Prefix3       31.00        1.69       30.45        2.29  -13% -   11%  ( -6% -   20%)
          OrHighHigh        4.16        0.47        4.11        0.34  -18% -   20%  (-14% -   27%)
           OrHighMed        4.98        0.59        4.94        0.41  -18% -   22%  (-14% -   27%)
             Respell       40.29        2.11       40.11        2.13  -10% -   10%  (-15% -    2%)
        TermBGroup1M       20.50        0.32       20.52        0.80   -5% -    5%  (  1% -   10%)
         TermGroup1M       13.51        0.43       13.61        0.40   -5% -    7%  (  1% -    9%)
              Fuzzy1       43.20        1.83       44.02        1.95   -6% -   11%  (-11% -    1%)
            PKLookup       87.16        1.78       89.52        0.94    0% -    5%  ( -2% -    7%)
              Fuzzy2       16.09        0.80       16.54        0.77   -6% -   13%  (-11% -    6%)
                Term       43.56        1.53       45.26        3.84   -8% -   16%  (  2% -   26%)
      TermBGroup1M1P       21.33        0.64       22.24        1.23   -4% -   13%  (  0% -   14%) 
{noformat}

Also, the For postingsformat shows few performance change. So I suppose the bottleneck isn't in this method: PForUtil.patchException.
Here is an example with blockSize=64:
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.03        0.45        3.30        0.43  -47% -  -18%
          AndHighMed       28.05        2.33       18.83        1.77  -43% -  -19%
            SpanNear        2.69        0.18        1.94        0.25  -40% -  -12%
        SloppyPhrase        4.19        0.20        3.22        0.35  -34% -  -10%
         AndHighHigh        7.61        0.46        6.41        0.54  -27% -   -2%
             Respell       41.36        1.65       37.94        2.42  -17% -    1%
            Wildcard       19.20        0.77       17.89        0.99  -15% -    2%
          OrHighHigh        4.22        0.37        3.94        0.32  -21% -   10%
           OrHighMed        5.06        0.46        4.73        0.39  -21% -   11%
              Fuzzy1       44.15        1.31       42.38        1.74  -10% -    2%
              Fuzzy2       16.48        0.59       15.84        0.76  -11% -    4%
         TermGroup1M       13.32        0.35       13.44        0.53   -5% -    7%
            PKLookup       87.70        1.81       88.62        1.22   -2% -    4%
        TermBGroup1M       20.14        0.47       20.40        0.59   -3% -    6%
             Prefix3       30.31        1.49       31.08        2.26   -9% -   15%
      TermBGroup1M1P       21.13        0.46       21.79        1.42   -5% -   12%
              IntNRQ        3.96        0.45        4.14        0.46  -16% -   31%
                Term       43.07        1.51       46.06        4.50   -6% -   21%
{noformat}
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262835#comment-13262835 ] 

Robert Muir commented on LUCENE-3892:
-------------------------------------

I will commit this patch: please let us know if you have more problems from china! :)
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment:     (was: LUCENE-3892_for.patch)
    
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262711#comment-13262711 ] 

Robert Muir commented on LUCENE-3892:
-------------------------------------

Maybe a good solution is if we have an ant property (that we somehow pass to ivy), and
we conditionally set it in ant by default to a server we know that works in china,
if the "${user.language}"="zh" ?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_for_unfold_method.patch
                LUCENE-3892_pfor_unfold_method.patch

The *unfold_method.patch just remove the nested call of PForDecompressImpl.decode, and also clip out numBytes information for ForPF. 
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3892:
--------------------------------

    Attachment: LUCENE-3892_settings.patch

can you remove your hack and try this patch?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247175#comment-13247175 ] 

Han Jiang edited comment on LUCENE-3892 at 4/5/12 12:20 PM:
------------------------------------------------------------

Hi Mike,
I have changed my proposal a bit, but here are some questions here:

{quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can "just" implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
   	It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
    	It will define how terms,docids,freq,offset are written into posting files.
    	inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
    	It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc.
    	inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that "BlockTermsReader/Writer" have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the "fixed codec"? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

It seems that skip settings are not so related to backend codec? Do you mean the nocommit line in FixedPostingsWriterImpl.java:117 ?

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better. This feature will also come as "phase 2".
                
      was (Author: billy):
    Hi Mike,
I have changed my proposal a bit, but here are some questions here:

{quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can "just" implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
   	It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
    	It will define how terms,docids,freq,offset are written into posting files.
    	inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
    	It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc.
    	inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that "BlockTermsReader/Writer" have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the "fixed codec"? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

I haven't investigated much about different kinds of queries. What are skipper settings? 

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better. This feature will also come as "phase 2".
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260149#comment-13260149 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Thank all of you for providing me this opportunity! Let us begin!
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398800#comment-13398800 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

And same codes with the wikimediumhard.tasks file.(This is really a hard testcase, since QPS are so small that we can hardly depend on Pct Diff :) )
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
          AndHighMed       10.76        0.21        6.47        0.32  -43% -  -35%
         AndHighHigh        2.89        0.08        2.57        0.19  -20% -   -1%
            SpanNear        0.60        0.01        0.55        0.01  -11% -   -6%
        SloppyPhrase        0.61        0.01        0.57        0.01   -9% -   -3%
            PKLookup       87.72        2.61       86.28        1.48   -6% -    3%
              Fuzzy1       36.22        1.14       35.90        0.97   -6% -    5%
              Phrase        1.22        0.03        1.22        0.08   -9% -    8%
             Respell       32.84        0.92       33.55        0.87   -3% -    7%
              IntNRQ        3.66        0.35        3.74        0.08   -8% -   15%
              Fuzzy2       21.62        0.66       22.10        0.51   -3% -    7%
             Prefix3       13.30        0.49       14.09        0.76   -3% -   15%
           OrHighMed        3.43        0.16        3.65        0.45  -10% -   25%
          OrHighHigh        1.66        0.09        1.79        0.22  -10% -   28%
            Wildcard        3.39        0.14        3.74        0.20    0% -   21%
      TermBGroup1M1P        1.84        0.03        2.10        0.16    3% -   25%
         TermGroup1M        1.14        0.03        1.34        0.10    5% -   29%
        TermBGroup1M        1.49        0.05        1.78        0.13    7% -   32%
                Term        3.49        0.13        4.38        0.65    2% -   49%
{noformat}
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262694#comment-13262694 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

It's quite strange that sometimes I cannot access repo1.maven.org, therefore "ant ivy-boostrap" & "ant resolve" will fail to work.(Since I'm in China, the network connection might be limited).

Once Mike and I hoped to make things work by configuring "lucene/common-build.xml" & "dev-tools/scripts/poll-mirrors.pl" to another maven mirror, listed in http://docs.codehaus.org/display/MAVENUSER/Mirrors+Repositories. Unfortunately, the main site "repo1.maven.org" was configured into ivy-2.2.0.jar, and even we pass "ant ivy-bootstrap", "ant resolve" still fails.

Well, here is how I get things work(too ugly, hope a better suggestion!):

change /etc/hosts,
and redirect current maven site to a mirror with same directory structure, for example: 

194.8.197.22    repo1.maven.org # to http://mirror.netcologne.de/
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265950#comment-13265950 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

A postings format named VSEncoding also seems promising! 

It is available here: http://integerencoding.isti.cnr.it/

And license compatible: https://github.com/maropu/integer_encoding_library/blob/master/LICENSE
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398507#comment-13398507 ] 

Han Jiang edited comment on LUCENE-3892 at 6/21/12 3:53 PM:
------------------------------------------------------------

For decompressing phase, replace the use of IntBuffer with a direct int[] to int[] decoder. Method convert() is supposed to be performant enough...coz it is not different from the inner implementation of IntBuffer.get(), i.e.http://massapi.com/source/jdk1.6.0_17/src/java/nio/Bits.java.html, line 193. However, result isn't interesting.

Hmm, there is an extra block of memory write here, which Mike wanted to avoid in previous patch. That should be the cause.
                
      was (Author: billy):
    for decompressing phase, replace the use of IntBuffer with a direct int[] to int[] decoder. Method convert() is supposed to be performant enough...coz it is not different from the inner implementation of IntBuffer.get(), i.e.http://massapi.com/source/jdk1.6.0_17/src/java/nio/Bits.java.html, line 193. However, result isn't interesting.
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397694#comment-13397694 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

OK, just reproduce your test. But Mike, are we using a same task file? Our relative speeds for different queries are not the same. 
{quote}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.07        0.45        3.76        0.19  -35% -  -14% (-44% -  -18%)
          AndHighMed       28.32        2.34       22.67        0.67  -28% -  -10% (-38% -   -9%)
            SpanNear        2.72        0.13        2.36        0.14  -22% -   -3% (-36% -   -8%)
        SloppyPhrase        4.18        0.20        3.83        0.15  -16% -    0% (-33% -   -6%)
             Respell       42.02        1.83       38.86        2.30  -16% -    2% (-18% -    0%)
              Fuzzy1       44.96        1.58       42.85        1.69  -11% -    2% (-12% -    0%)
              Fuzzy2       16.78        0.69       16.34        0.68  -10% -    5% (-12% -    3%)
            PKLookup       89.11        2.15       87.33        2.19   -6% -    2% ( -2% -    5%)
         AndHighHigh        7.61        0.44        7.69        0.21   -7% -   10% (-21% -   10%)
            Wildcard       19.50        0.91       20.02        0.72   -5% -   11% (-21% -    3%)
        TermBGroup1M       20.82        0.37       21.73        0.69    0% -    9% (  2% -   10%)
         TermGroup1M       13.79        0.13       14.61        0.32    2% -    9% (  1% -    9%)
              IntNRQ        4.11        0.56        4.56        0.56  -14% -   43% (-25% -   33%)
      TermBGroup1M1P       21.45        0.75       24.00        0.51    5% -   18% ( -1% -   22%)
           OrHighMed        5.08        0.49        5.73        0.15    0% -   28% (-16% -   25%)
          OrHighHigh        4.22        0.39        4.78        0.13    1% -   28% (-15% -   24%)
             Prefix3       30.91        1.63       35.65        2.02    3% -   28% (-14% -   21%)
                Term       44.36        1.87       54.01        1.96   12% -   31% ( -1% -   33%)
{quote}
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3892:
--------------------------------

    Attachment: LUCENE-3892_settings.patch

updated patch with also logic for ivy-bootstrap. if repo1.maven.org fails, we try the same china-friendly mirror (currently http://mirror.netcologne.de/maven2). We disable fail-on-error, instead sha1-checksum the result at the end to determine real success or not (and if it fails that, prints a message suggesting you manually download it)
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Chris Male (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399869#comment-13399869 ] 

Chris Male commented on LUCENE-3892:
------------------------------------

It's really interesting the effect of peeling back those abstractions.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3892:
---------------------------------------

    Attachment: LUCENE-3892-direct-IntBuffer.patch

The For index is 5.2 GB vs 4.9 GB for vInt: not bad to have only 5%
increase in index size when using For PF (10M wikipedia index).

{quote}
Get more direct access to the file as an int[]; eg MMapDir could
expose an IntBuffer from its ByteBuffer (saving the initial copy
into byte[] that we now do). 
{quote}

I tested this, by making hacked up changes to Billy's For patch
requiring MMapDirectory and pulling an IntBuffer directly from its
ByteBuffer, saving one copy of bytes into the byte[] first.  But,
curiously, it didn't seem to improve things much:

{noformat}
                Task    QPS base StdDev base     QPS for  StdDev for      Pct diff
          AndHighMed       24.32        0.60       14.24        0.41  -44% -  -38%
            PKLookup      131.98        3.09      108.35        1.47  -20% -  -14%
         AndHighHigh        5.36        0.18        4.66        0.02  -16% -   -9%
              Phrase        1.48        0.02        1.33        0.10  -18% -   -2%
        SloppyPhrase        1.40        0.04        1.26        0.03  -13% -   -5%
            SpanNear        1.14        0.01        1.04        0.02  -10% -   -6%
              IntNRQ       12.13        0.70       11.27        0.46  -15% -    2%
             Prefix3       34.51        1.17       34.11        1.28   -8% -    6%
              Fuzzy1       90.63        1.74       89.68        1.46   -4% -    2%
             Respell       77.22        2.62       76.99        1.62   -5% -    5%
            Wildcard       11.84        0.40       12.20        0.37   -3% -    9%
              Fuzzy2       34.34        0.82       36.16        1.08    0% -   11%
      TermBGroup1M1P        4.71        0.11        5.02        0.18    0% -   12%
           OrHighMed        7.87        0.28        8.50        0.55   -2% -   19%
        TermBGroup1M        3.47        0.03        3.78        0.03    7% -   11%
         TermGroup1M        2.96        0.01        3.25        0.03    8% -   11%
          OrHighHigh        3.55        0.12        3.91        0.21    0% -   20%
                Term        9.72        0.28       10.87        0.44    4% -   19%
{noformat}

Maybe, instead, reading into an int[] and decoding from an int array
(hopefully avoiding bounds checks) will be faster than calling
IntBuffer.get for each encoded int...

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13289104#comment-13289104 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Thanks Mike, we have so much details to help optimize!

bq.Still missing a couple license headers (TestMin, TestCompress)...
Ok, I'll add them later.

bq.I ran a quick perf test using http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc Wikipedia index.
The script is wonderful! But the wiki data is missing? Can I get it from a wiki dump instead?

bq.Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec).
Yes, it is expected, actually it scans every block 33 times to estimate metadata such as numFrameBits and numExceptions.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396987#comment-13396987 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Oh, thank you Mike! I haven't thought too much about those skipping policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress....
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to remove header and call ForDecompressImpl directly in readBlock():with For, blockSize=128. Data in bracket show prior benchmark.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        4.99        0.37        3.57        0.26  -38% -  -17% (-44% -  -18%)
          AndHighMed       28.91        2.17       22.66        0.82  -29% -  -12% (-38% -   -9%)
            SpanNear        2.72        0.14        2.22        0.13  -26% -   -8% (-36% -   -8%)
        SloppyPhrase        4.24        0.26        3.70        0.16  -21% -   -3% (-33% -   -6%)
             Respell       40.71        2.59       37.66        1.36  -16% -    2% (-18% -    0%)
              Fuzzy1       43.22        2.01       40.66        0.32  -10% -    0% (-12% -    0%)
              Fuzzy2       16.25        0.90       15.64        0.26  -10% -    3% (-12% -    3%)
            Wildcard       19.07        0.86       19.07        0.73   -8% -    8% (-21% -    3%)
         AndHighHigh        7.76        0.47        7.77        0.15   -7% -    8% (-21% -   10%)
            PKLookup       87.50        4.56       88.51        1.24   -5% -    8% ( -2% -    5%)
        TermBGroup1M       20.42        0.87       21.32        0.74   -3% -   12% (  2% -   10%)
           OrHighMed        5.33        0.68        5.61        0.14   -9% -   23% (-16% -   25%)
          OrHighHigh        4.43        0.53        4.69        0.12   -8% -   23% (-15% -   24%)
         TermGroup1M       13.30        0.34       14.31        0.40    2% -   13% (  0% -   13%)
      TermBGroup1M1P       20.92        0.59       23.71        0.86    6% -   20% ( -1% -   22%)
             Prefix3       30.30        1.41       35.14        1.76    5% -   27% (-14% -   21%)
              IntNRQ        3.90        0.54        4.58        0.47   -7% -   50% (-25% -   33%)
                Term       42.17        1.55       52.33        2.57   13% -   35% (  1% -   33%)
{noformat}
The improvement is quite general. However, I still suppose this just benefits from less method calling. I'm trying to change the PFor codes, and remove those nested call.

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other bits in exception area)  natually fits "partially decode a block", that'll be possible when we optimize skipping queries.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240527#comment-13240527 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

That's great Han, I'll have a look.

I can be a mentor for this...
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240403#comment-13240403 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Hi, I have submitted my proposal. Comments are welcome!
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395894#comment-13395894 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

There's a potential bottleneck during method calling...Here is an example for PFor, with blocksize=128, exception rate = 97%, normal value <= 2 bits, exception value <= 32 bits:

{noformat}
Decoding normal values:                              4703 ns
Patching exceptions:                                 5797 ns
Single call of PForUtil.decompress totally takes:   58318 ns
{noformat}

In addition, it costs about 4000ns to record the time span.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247175#comment-13247175 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

{quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can "just" implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
   	It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
    	It will define how terms,docids,freq,offset are written into posting files.
    	inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
    	It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc.
    	inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that "BlockTermsReader/Writer" have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the "fixed codec"? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

I haven't investigated much about different kinds of queries. What are skipper settings? 

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better? This feature will also come as "phase 2".

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287951#comment-13287951 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Ah, yes, I forgot to use -Dtests.postingsformat...I can see the errors
now.

{quote}
TestDemo makes a nice TestMin... I usually start with TestDemo when
testing scary new code, and then it's a huge milestone once TestDemo
passes 
{quote}
Hmm, that means I should remove TestMin.java? This testcase works fine
for the patch.

{quote}
We should definitely cutover to BlockTree terms dict (I would upgrade
that TODO to a nocommit!).
{quote}
I'm not quite familiar with these sign stuff, shall I change all the 
"TODO" sign into "nocommit"? Are the signs related to documentation, 
or just marked to remember not to commit current codes?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291850#comment-13291850 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

OK, here is a result I tried to reproduce with Mike's test script:
Indexing time:
    trunk: 2396 sec
    patch: 2793 sec

Searching time:
{noformat}
          TaskQPS Lucene40StdDev Lucene40    QPS PFor StdDev PFor      Pct diff
          AndHighMed       22.76        0.54       14.68        1.00  -41% -  -29%
        SloppyPhrase        3.58        0.17        2.46        0.27  -41% -  -19%
            SpanNear        5.90        0.09        4.08        0.37  -38% -  -23%
         AndHighHigh       10.00        0.17        8.08        0.57  -26% -  -11%
              Phrase        1.68        0.07        1.45        0.17  -27% -    0%
             Respell       37.65        0.74       33.41        1.04  -15% -   -6%
              Fuzzy1       38.00        1.60       34.37        1.06  -15% -   -2%
              IntNRQ        4.27        0.33        3.87        0.19  -19% -    3%
              Fuzzy2       16.35        0.60       15.02        0.31  -13% -   -2%
            Wildcard       30.24        0.57       28.24        1.85  -14% -    1%
            PKLookup       85.82        5.04       83.25        2.81  -11% -    6%
             Prefix3       19.20        0.40       19.19        1.46   -9% -    9%
           OrHighMed        9.25        0.59        9.41        0.70  -11% -   16%
         TermGroup1M       11.46        0.62       11.74        0.81   -9% -   15%
          OrHighHigh        3.15        0.17        3.28        0.23   -8% -   17%
      TermBGroup1M1P       19.28        0.38       20.32        1.14   -2% -   13%
        TermBGroup1M        6.23        0.21        6.71        0.46   -3% -   19%
                Term       30.86        1.52       34.34        3.26   -4% -   28%
{noformat}

It is done on a 64bit AMD server with Java 1.7.0.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288675#comment-13288675 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Excellent!  All tests also pass for me w/ PFor postings format as
well... this is a great starting point :) One Solr test failed
(ContentStreamTest)... but I think it was false failure...

I did notice the tests seem to run slower, especially certain ones eg
TestJoinUtil.

Still missing a couple license headers (TestMin, TestCompress)...

I ran a quick perf test using
http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc
Wikipedia index.

Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs
1261 sec).

But more important is the slower search times:

{noformat}
                Task    QPS base StdDev base    QPS pfor StdDev pfor      Pct diff
              Phrase        8.52        0.50        4.43        0.40  -55% -  -39%
        SloppyPhrase       12.52        0.39        7.87        0.51  -43% -  -30%
          AndHighMed       67.69        2.82       44.22        1.47  -39% -  -29%
            SpanNear        5.19        0.12        3.90        0.28  -31% -  -17%
            PKLookup      112.16        1.71       95.61        1.30  -17% -  -12%
         AndHighHigh       13.22        0.34       11.86        0.72  -17% -   -2%
            Wildcard       46.04        0.37       41.68        4.45  -19% -    1%
              Fuzzy1       50.11        2.03       48.06        1.91  -11% -    3%
           OrHighMed        9.26        0.48        8.90        0.37  -12% -    5%
          OrHighHigh       12.28        0.56       11.83        0.49  -11% -    5%
      TermBGroup1M1P       40.47        1.94       39.88        2.51  -11% -   10%
              Fuzzy2       53.71        2.66       53.01        2.08   -9% -    7%
         TermGroup1M       36.46        1.21       35.99        1.58   -8% -    6%
        TermBGroup1M       55.53        1.99       55.26        2.68   -8% -    8%
             Respell       69.71        4.49       69.73        2.07   -8% -   10%
                Term       94.38        7.62       94.96       12.19  -18% -   23%
             Prefix3       41.63        0.34       42.21        5.82  -13% -   16%
              IntNRQ        7.08        0.15        7.28        1.29  -17% -   23%
{noformat}

The queries that do skipping are quite a bit slower; this makes sense,
since on skip we do a full block decode.  A smaller block size (we use
128 now right?) should help I think.

It's strange that the non-skipping queries (Term, OrHighMed,
OrHighHigh) don't show any performance gain ... maybe we need to
optimize the decode... or it could be the removal of the bulk api
is hurting us here.

I'm also curious if we tried a pure FOR (no patching, so we must set
numBits according to the max value = larger index but hopefully faster
decode) if the results would improve...


                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397228#comment-13397228 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

And result for PFor(blocksize=128):
{noformat}
                Task    QPS Base StdDev Base    QPS PFor StdDev PFor      Pct diff
              Phrase        4.87        0.36        3.39        0.18  -38% -  -20% (-47% -  -25%)
          AndHighMed       27.78        2.35       21.13        0.52  -31% -  -14% (-37% -  -15%)
            SpanNear        2.70        0.14        2.20        0.11  -26% -   -9% (-36% -  -13%)
        SloppyPhrase        4.17        0.15        3.77        0.21  -17% -    0% (-30% -   -6%)
             Respell       39.97        1.56       37.65        1.95  -14% -    3% (-15% -    2%)
            Wildcard       19.08        0.77       18.33        0.92  -12% -    5% (-17% -    3%)
              Fuzzy1       42.29        1.13       40.78        1.44   -9% -    2% (-11% -    1%)
         AndHighHigh        7.61        0.55        7.45        0.08   -9% -    6% (-19% -    6%)
              Fuzzy2       15.79        0.55       15.64        0.70   -8% -    7% (-11% -    6%)
            PKLookup       86.71        2.13       88.92        2.24   -2% -    7% ( -2% -    7%)
         TermGroup1M       13.04        0.23       14.03        0.40    2% -   12% (  1% -    9%)
              IntNRQ        3.97        0.48        4.35        0.61  -15% -   41% (-16% -   24%)
      TermBGroup1M1P       21.04        0.35       23.20        0.60    5% -   14% (  0% -   14%)
        TermBGroup1M       19.27        0.47       21.28        0.84    3% -   17% (  1% -   10%)
          OrHighHigh        4.13        0.47        4.63        0.27   -5% -   34% (-14% -   27%)
           OrHighMed        4.95        0.59        5.58        0.34   -5% -   35% (-14% -   27%)
             Prefix3       30.33        1.36       34.26        2.14    1% -   25% ( -6% -   20%)
                Term       41.99        1.19       50.75        1.72   13% -   28% (  2% -   26%)
{noformat}
It works, and it is quite interesting that StdDev for Term query is reduced significantly.  
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_pfor.patch

Ah, just cannot wait for a performance optimization!

This version should now pass all tests below: 

ant test-core -Dtests.postingsformat=PFor

It fixes: 1) trailing forced exceptions will be ignored and encoded as normal value; 2) IntBuffer is maintained at IndexInput/Output level; 3) Former nocommit issues such as BlockTreeTerms* and code licence.

The patch also contains a minimal change with the help of Robert's patch: https://issues.apache.org/jira/secure/attachment/12530685/LUCENE-4102.patch. Hope Dawid will commit the complete version into trunk soon!

I'll try to optimize these codes later.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262772#comment-13262772 ] 

Han Jiang commented on LUCENE-3892:
-----------------------------------

Thank you Robert! The patch works well. 
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397605#comment-13397605 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

OK I created a branch and committed last For patch: https://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262777#comment-13262777 ] 

Robert Muir commented on LUCENE-3892:
-------------------------------------

Patch does not yet fix ivy-bootstrap. Ivy-bootstrap still only tries repo1.maven.org. We need a different strategy for that: either we depend on try-catch from ant contrib (undesired), use custom ant task (grrrr), or use a chain of targets with fail-on-error=false unless the file already exists and checksum at the end... Lemme see if i can fix ivy-bootstrap, too!
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>         Attachments: LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_for_byte[].patch

Now remove the memory write codes, and replace IntBuffer.get() with getInt(byte,byte,byte,byte), since this patch contains method unfolding, there is no actually difference...Seems that we're paying attenting on a wrong point.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.02        0.46        3.66        0.30  -38% -  -13%
          AndHighMed       28.08        2.29       23.04        1.01  -27% -   -6%
            SpanNear        2.69        0.16        2.30        0.19  -25% -    0%
        SloppyPhrase        4.18        0.22        3.83        0.18  -16% -    1%
             Respell       41.92        2.15       39.54        2.45  -15% -    5%
              Fuzzy1       44.47        1.99       43.34        3.07  -13% -    9%
            Wildcard       19.70        1.06       19.60        1.16  -11% -   11%
              Fuzzy2       16.54        0.86       16.52        1.16  -11% -   12%
            PKLookup       87.32        2.47       88.62        1.33   -2% -    6%
         AndHighHigh        7.55        0.43        7.84        0.15   -3% -   12%
        TermBGroup1M       19.86        0.14       21.41        0.70    3% -   12%
         TermGroup1M       13.35        0.17       14.40        0.38    3% -   12%
              IntNRQ        4.10        0.57        4.45        0.73  -20% -   46%
      TermBGroup1M1P       21.29        0.63       23.45        0.82    3% -   17%
             Prefix3       31.13        1.71       35.53        2.90    0% -   30%
           OrHighMed        4.96        0.61        5.83        0.35   -1% -   42%
          OrHighHigh        4.13        0.49        4.87        0.29    0% -   41%
                Term       42.93        1.17       52.11        2.21   13% -   30%
{noformat}
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397958#comment-13397958 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

bq.  But Mike, are we using a same task file? Our relative speeds for different queries are not the same.

Sorry, I'm using a hand edited "hard" tasks file; I'll commit & push to luceneutil.  But, separately: each run picks a different subset of the tasks from each category to run, so results from one run to another in general aren't comparable unless we fix the random seed it uses.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262701#comment-13262701 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Phew, I'm glad to hear you got it working!  So "ant resolve" finished successfully?
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_for.patch
                LUCENE-3892_pfor.patch
    
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment: LUCENE-3892_pfor.patch

Here is a initial implementation of PForPostingsFormat. It is registered in oal.codecs.mockrandom.MockRandomPostingsFormat, and all tests have passed (Maybe I should modify some other mock files as well?).

This version is orginally inspired by the pfor and pfor2 impls in bulk_branch, mostly by the idea of pfor. Currently, the compressed data consists of three parts: header, normal area, and excpetion area. The normal area encodes each small  value as b bits, as well as exception values. The exception area stores each large value directly, possibly as 8,16,or 32 bits. NumFrameBits range from 1-32 are all supported.

I haven't test the performance, but there are some known bottlenecks: For example, data = {0, 0xffffffff, 0, 1, 0, 1, 0}, numFrameBits=1, then the following '1's will be forced as exceptions, which will dramatically increase compressed size.
                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3892:
------------------------------

    Attachment:     (was: LUCENE-3892_pfor.patch)
    
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397694#comment-13397694 ] 

Han Jiang edited comment on LUCENE-3892 at 6/20/12 5:57 PM:
------------------------------------------------------------

OK, just reproduce your test. But Mike, are we using a same task file? Our relative speeds for different queries are not the same. 
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.07        0.45        3.76        0.19  -35% -  -14% (-44% -  -18%)
          AndHighMed       28.32        2.34       22.67        0.67  -28% -  -10% (-38% -   -9%)
            SpanNear        2.72        0.13        2.36        0.14  -22% -   -3% (-36% -   -8%)
        SloppyPhrase        4.18        0.20        3.83        0.15  -16% -    0% (-33% -   -6%)
             Respell       42.02        1.83       38.86        2.30  -16% -    2% (-18% -    0%)
              Fuzzy1       44.96        1.58       42.85        1.69  -11% -    2% (-12% -    0%)
              Fuzzy2       16.78        0.69       16.34        0.68  -10% -    5% (-12% -    3%)
            PKLookup       89.11        2.15       87.33        2.19   -6% -    2% ( -2% -    5%)
         AndHighHigh        7.61        0.44        7.69        0.21   -7% -   10% (-21% -   10%)
            Wildcard       19.50        0.91       20.02        0.72   -5% -   11% (-21% -    3%)
        TermBGroup1M       20.82        0.37       21.73        0.69    0% -    9% (  2% -   10%)
         TermGroup1M       13.79        0.13       14.61        0.32    2% -    9% (  1% -    9%)
              IntNRQ        4.11        0.56        4.56        0.56  -14% -   43% (-25% -   33%)
      TermBGroup1M1P       21.45        0.75       24.00        0.51    5% -   18% ( -1% -   22%)
           OrHighMed        5.08        0.49        5.73        0.15    0% -   28% (-16% -   25%)
          OrHighHigh        4.22        0.39        4.78        0.13    1% -   28% (-15% -   24%)
             Prefix3       30.91        1.63       35.65        2.02    3% -   28% (-14% -   21%)
                Term       44.36        1.87       54.01        1.96   12% -   31% ( -1% -   33%)
{noformat}
                
      was (Author: billy):
    OK, just reproduce your test. But Mike, are we using a same task file? Our relative speeds for different queries are not the same. 
{quote}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.07        0.45        3.76        0.19  -35% -  -14% (-44% -  -18%)
          AndHighMed       28.32        2.34       22.67        0.67  -28% -  -10% (-38% -   -9%)
            SpanNear        2.72        0.13        2.36        0.14  -22% -   -3% (-36% -   -8%)
        SloppyPhrase        4.18        0.20        3.83        0.15  -16% -    0% (-33% -   -6%)
             Respell       42.02        1.83       38.86        2.30  -16% -    2% (-18% -    0%)
              Fuzzy1       44.96        1.58       42.85        1.69  -11% -    2% (-12% -    0%)
              Fuzzy2       16.78        0.69       16.34        0.68  -10% -    5% (-12% -    3%)
            PKLookup       89.11        2.15       87.33        2.19   -6% -    2% ( -2% -    5%)
         AndHighHigh        7.61        0.44        7.69        0.21   -7% -   10% (-21% -   10%)
            Wildcard       19.50        0.91       20.02        0.72   -5% -   11% (-21% -    3%)
        TermBGroup1M       20.82        0.37       21.73        0.69    0% -    9% (  2% -   10%)
         TermGroup1M       13.79        0.13       14.61        0.32    2% -    9% (  1% -    9%)
              IntNRQ        4.11        0.56        4.56        0.56  -14% -   43% (-25% -   33%)
      TermBGroup1M1P       21.45        0.75       24.00        0.51    5% -   18% ( -1% -   22%)
           OrHighMed        5.08        0.49        5.73        0.15    0% -   28% (-16% -   25%)
          OrHighHigh        4.22        0.39        4.78        0.13    1% -   28% (-15% -   24%)
             Prefix3       30.91        1.63       35.65        2.02    3% -   28% (-14% -   21%)
                Term       44.36        1.87       54.01        1.96   12% -   31% ( -1% -   33%)
{quote}
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Posted by "Han Jiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398582#comment-13398582 ] 

Han Jiang edited comment on LUCENE-3892 at 6/21/12 5:01 PM:
------------------------------------------------------------

Now remove the memory write codes, and replace IntBuffer.get() with getInt(byte,byte,byte,byte), since this patch contains method unfolding, there is no actually difference...Seems that we're paying attenting on a wrong point.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.02        0.46        3.66        0.30  -38% -  -13% (-38% -  -17%)
          AndHighMed       28.08        2.29       23.04        1.01  -27% -   -6% (-29% -  -12%)
            SpanNear        2.69        0.16        2.30        0.19  -25% -    0% (-26% -   -8%)
        SloppyPhrase        4.18        0.22        3.83        0.18  -16% -    1% (-21% -   -3%)
             Respell       41.92        2.15       39.54        2.45  -15% -    5% (-16% -    2%)
              Fuzzy1       44.47        1.99       43.34        3.07  -13% -    9% (-10% -    0%)
            Wildcard       19.70        1.06       19.60        1.16  -11% -   11% ( -8% -    8%)
              Fuzzy2       16.54        0.86       16.52        1.16  -11% -   12% (-10% -    3%)
            PKLookup       87.32        2.47       88.62        1.33   -2% -    6% ( -5% -    8%)
         AndHighHigh        7.55        0.43        7.84        0.15   -3% -   12% ( -7% -    8%)
        TermBGroup1M       19.86        0.14       21.41        0.70    3% -   12% ( -3% -   12%)
         TermGroup1M       13.35        0.17       14.40        0.38    3% -   12% (  2% -   13%)
              IntNRQ        4.10        0.57        4.45        0.73  -20% -   46% ( -7% -   50%)
      TermBGroup1M1P       21.29        0.63       23.45        0.82    3% -   17% (  6% -   20%)
             Prefix3       31.13        1.71       35.53        2.90    0% -   30% (  5% -   27%)
           OrHighMed        4.96        0.61        5.83        0.35   -1% -   42% ( -9% -   23%)
          OrHighHigh        4.13        0.49        4.87        0.29    0% -   41% ( -8% -   23%)
                Term       42.93        1.17       52.11        2.21   13% -   30% ( 13% -   35%)
{noformat}
It is compared with result in https://issues.apache.org/jira/browse/LUCENE-3892?focusedCommentId=13396987&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13396987
                
      was (Author: billy):
    Now remove the memory write codes, and replace IntBuffer.get() with getInt(byte,byte,byte,byte), since this patch contains method unfolding, there is no actually difference...Seems that we're paying attenting on a wrong point.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct diff
              Phrase        5.02        0.46        3.66        0.30  -38% -  -13%
          AndHighMed       28.08        2.29       23.04        1.01  -27% -   -6%
            SpanNear        2.69        0.16        2.30        0.19  -25% -    0%
        SloppyPhrase        4.18        0.22        3.83        0.18  -16% -    1%
             Respell       41.92        2.15       39.54        2.45  -15% -    5%
              Fuzzy1       44.47        1.99       43.34        3.07  -13% -    9%
            Wildcard       19.70        1.06       19.60        1.16  -11% -   11%
              Fuzzy2       16.54        0.86       16.52        1.16  -11% -   12%
            PKLookup       87.32        2.47       88.62        1.33   -2% -    6%
         AndHighHigh        7.55        0.43        7.84        0.15   -3% -   12%
        TermBGroup1M       19.86        0.14       21.41        0.70    3% -   12%
         TermGroup1M       13.35        0.17       14.40        0.38    3% -   12%
              IntNRQ        4.10        0.57        4.45        0.73  -20% -   46%
      TermBGroup1M1P       21.29        0.63       23.45        0.82    3% -   17%
             Prefix3       31.13        1.71       35.53        2.90    0% -   30%
           OrHighMed        4.96        0.61        5.83        0.35   -1% -   42%
          OrHighHigh        4.13        0.49        4.87        0.29    0% -   41%
                Term       42.93        1.17       52.11        2.21   13% -   30%
{noformat}
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org