You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Renaud Delbru (JIRA)" <ji...@apache.org> on 2011/01/24 19:32:44 UTC

[jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Adaptive Frame Of Reference 
----------------------------

                 Key: LUCENE-2886
                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Codecs
            Reporter: Renaud Delbru
             Fix For: 4.0


We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].

[1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Posted by Renaud Delbru <re...@deri.org>.

-- sorry, resending it as I don't know what happens to the layout of the 
previous one

Hi Paul,

This is a good question. The two methods, i.e., VSE and AFOR, are very 
similar. The two methods can be considered as an extension of FOR to 
make it less sensitive to outliers by adapting the encoding to the value 
distribution. To achieve this, the two methods are encoding a list of 
values by
- partitioning it into "frames" (or sequence of consecutive integers) of 
variable lengths,
- encoding each frame using a different "bit frame" (the minimum number 
of bits required to encode any integer in the frame, and still be able 
to distinguish them)
- relying on algorithms to automatically find a good list partitioning.

Apart from the minor differences in the implementation design (that I 
will discuss later), the main difference is that VSE is optimised for 
achieving a high compression rate and a fast decompression but 
disregards the efficiency of compression, while AFOR is optimised for 
achieving a high compression rate, a fast decompression but also a fast 
compression speed. VSE is using a Dynamic Programming method to find the 
*optimal partitioning* of a list (optimal in term of compression rate). 
While this approach provides a higher compression rate than the one 
proposed in AFOR, the complexity of such a partitioning algorithm is O(n 
* k), with the term n being the number of values and the term k the size 
of the larger frame, which might greatly impact the compression 
performance. In AFOR, we use instead a local optimisation algorithm that 
is less effective in term of compression rate but faster to compute.

In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current 
implementation of AFOR restrict the length of a frame to be a multiple 
of 8 to to be aligned with the start and end of a byte boundary (and 
also to minimise the number of loop-unrolled highly-optimised routines). 
More precisely, AFOR-2 use three frame lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original 
implementation of VSE needs to operate on the full list. On the 
contrary, AFOR has been developed to operate on small subsets of the 
list, so that AFOR can be applied during incremental construction of the 
compressed list (it does not require the full list, but works on small 
block of 32 or more integers). However, we can think of applying VSE on 
small subset, as in AFOR. In this case, VSE does not compute the optimal 
partition of a list, but only the optimal partition of the subset of the 
list.

VSE and AFOR encodes a frame in a similar way: first, a header (1 byte) 
which provides the bit frame and the frames length, then the encoded frame.

So, as you can see, in essence, the two models are very similar. For the 
background, I know well Fabrizio Silvestri (co-author of VSE), and he 
was my PhD thesis examiner (the AFOR compression scheme is a chapter of 
my thesis). The funny thing is that we come up with these two models at 
the same time, this summer, without knowing we were working on something 
similar ;o). However, he was more lucky than I am to publish his 
findings before me.

I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
-- 
Renaud Delbru

On 25/01/11 12:24, Renaud Delbru wrote:
> Hi Paul,
>
> This is a good question. The two methods, i.e., VSE and AFOR, are very 
> similar. The two methods can be considered as an extension of FOR to 
> make it less sensitive to outliers by adapting the encoding to the 
> value distribution. To achieve this, the two methods are encoding a 
> list of values by
> - partitioning it into "frames" (or sequence of consecutive integers) 
> of variable lengths,
> - encoding each frame using a different "bit frame" (the minimum 
> number of bits required to encode any integer in the frame, and still 
> be able to distinguish them)
> - relying on algorithms to automatically find a good list partitioning.
>
> Apart from the minor differences in the implementation design (that I 
> will discuss later), the main difference is that VSE is optimised for 
> achieving a high compression rate and a fast decompression but 
> disregards the efficiency of compression, while AFOR is optimised for 
> achieving a high compression rate, a fast decompression but also a 
> fast compression speed. VSE is using a Dynamic Programming method to 
> find the *optimal partitioning* of a list (optimal in term of 
> compression rate). While this approach provides a higher compression 
> rate than the one proposed in AFOR, the complexity of such a 
> partitioning algorithm is O(n * k), with the term n being the number 
> of values and the term k the size of the larger frame, which might 
> greatly impact the compression performance. In AFOR, we use instead a 
> local optimisation algorithm that is less effective in term of 
> compression rate but faster to compute.
>
> In term of implementation details, there is a few differences.
> 1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The 
> current implementation of AFOR restrict the length of a frame to be a 
> multiple of 8 to to be aligned with the start and end of a byte 
> boundary (and also to minimise the number of loop-unrolled 
> highly-optimised routines). More precisely, AFOR-2 use three frame 
> lengths: 8, 16 and 32.
> 2) To allow the *optimal partitioning* of a list, the original 
> implementation of VSE needs to operate on the full list. On the 
> contrary, AFOR has been developed to operate on small subsets of the 
> list, so that AFOR can be applied during incremental construction of 
> the compressed list (it does not require the full list, but works on 
> small block of 32 or more integers). However, we can think of applying 
> VSE on small subset, as in AFOR. In this case, VSE does not compute 
> the optimal partition of a list, but only the optimal partition of the 
> subset of the list.
>
> VSE and AFOR encodes a frame in a similar way: first, a header (1 
> byte) which provides the bit frame and the frames length, then the 
> encoded frame.
>
> So, as you can see, in essence, the two models are very similar. For 
> the background, I know well Fabrizio Silvestri (co-author of VSE), and 
> he was my PhD thesis examiner (the AFOR compression scheme is a 
> chapter of my thesis). The funny thing is that we come up with these 
> two models at the same time, this summer, without knowing we were 
> working on something similar ;o). However, he was more lucky than I am 
> to publish his findings before me.
>
> I hope this answers to your question.
> Feel free to ask if you have any other questions,
> Regards,
> -- 
> Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Posted by Renaud Delbru <re...@deri.org>.

Hi Paul,

This is a good question. The two methods, i.e., VSE and AFOR, are very 
similar. The two methods can be considered as an extension of FOR to 
make it less sensitive to outliers by adapting the encoding to the value 
distribution. To achieve this, the two methods are encoding a list of 
values by
- partitioning it into "frames" (or sequence of consecutive integers) of 
variable lengths,
- encoding each frame using a different "bit frame" (the minimum number 
of bits required to encode any integer in the frame, and still be able 
to distinguish them)
- relying on algorithms to automatically find a good list partitioning.

Apart from the minor differences in the implementation design (that I 
will discuss later), the main difference is that VSE is optimised for 
achieving a high compression rate and a fast decompression but 
disregards the efficiency of compression, while AFOR is optimised for 
achieving a high compression rate, a fast decompression but also a fast 
compression speed. VSE is using a Dynamic Programming method to find the 
*optimal partitioning* of a list (optimal in term of compression rate). 
While this approach provides a higher compression rate than the one 
proposed in AFOR, the complexity of such a partitioning algorithm is O(n 
* k), with the term n being the number of values and the term k the size 
of the larger frame, which might greatly impact the compression 
performance. In AFOR, we use instead a local optimisation algorithm that 
is less effective in term of compression rate but faster to compute.

In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current 
implementation of AFOR restrict the length of a frame to be a multiple 
of 8 to to be aligned with the start and end of a byte boundary (and 
also to minimise the number of loop-unrolled highly-optimised routines). 
More precisely, AFOR-2 use three frame lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original 
implementation of VSE needs to operate on the full list. On the 
contrary, AFOR has been developed to operate on small subsets of the 
list, so that AFOR can be applied during incremental construction of the 
compressed list (it does not require the full list, but works on small 
block of 32 or more integers). However, we can think of applying VSE on 
small subset, as in AFOR. In this case, VSE does not compute the optimal 
partition of a list, but only the optimal partition of the subset of the 
list.

VSE and AFOR encodes a frame in a similar way: first, a header (1 byte) 
which provides the bit frame and the frames length, then the encoded frame.

So, as you can see, in essence, the two models are very similar. For the 
background, I know well Fabrizio Silvestri (co-author of VSE), and he 
was my PhD thesis examiner (the AFOR compression scheme is a chapter of 
my thesis). The funny thing is that we come up with these two models at 
the same time, this summer, without knowing we were working on something 
similar ;o). However, he was more lucky than I am to publish his 
findings before me.

I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
-- 
Renaud Delbru

On 24/01/11 22:02, Paul Elschot wrote:
> Any idea on how this compares to the vector split encoding here:
> http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html
> ?
>
> Regards,
> Paul Elschot
>
> On Monday 24 January 2011 19:32:44 Renaud Delbru (JIRA) wrote:
>> Adaptive Frame Of Reference
>> ----------------------------
>>
>>                   Key: LUCENE-2886
>>                   URL: https://issues.apache.org/jira/browse/LUCENE-2886
>>               Project: Lucene - Java
>>            Issue Type: New Feature
>>            Components: Codecs
>>              Reporter: Renaud Delbru
>>               Fix For: 4.0
>>
>>
>> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
>> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
>> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
>>
>> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Posted by Paul Elschot <pa...@xs4all.nl>.

Any idea on how this compares to the vector split encoding here:
http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html
?

Regards,
Paul Elschot

On Monday 24 January 2011 19:32:44 Renaud Delbru (JIRA) wrote:
> Adaptive Frame Of Reference 
> ----------------------------
> 
>                  Key: LUCENE-2886
>                  URL: https://issues.apache.org/jira/browse/LUCENE-2886
>              Project: Lucene - Java
>           Issue Type: New Feature
>           Components: Codecs
>             Reporter: Renaud Delbru
>              Fix For: 4.0
> 
> 
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> 
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990574#comment-12990574 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

based on these results, I think a good experiment would be for
us to "group" Simple64, so that we arent reading blocks of just a single long value.

With NIOFS/SimpleFS, the reads are buffered so i think some of the overhead
is taken care of, but it doesn't tend to work so well with MMap.

So, if we 'group' the long values so we are e.g. reading say N long values
at once in a single internal 'block', I think we might get more efficiency
via the I/O system, and also less overhead from the bulkpostings apis.

I would hope this would speed up the other queries, and likely slow down
the +nebraska +states type thing, but I think thats ok.


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Renaud Delbru updated LUCENE-2886:
----------------------------------

    Attachment: lucene-afor.tar.gz

tarball containing a maven project with source code and unit tests for:
- AFOR1
- AFOR2
- FOR
- PFOR Non Compulsive
- Simple64
- a basic tool for debugging IntBlock codecs.

It includes also the lucene-1458 snapshot dependencies that are necessary to compile the code and run the tests.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990611#comment-12990611 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
The BulkVInt codec is VInt implemented as a FixedIntBlock codec.
{quote}

Yes, I saw the code, it is a similar implementation of the VInt we used in our experiments.

{quote}
previously various codecs
looked much faster than Vint but a lot of the reason for this is due to the way Vint
was implemented...
{quote}

This is odd, because we observed the contrary (on the lucene-1458 branch). The standard codec was by an order of magnitude faster than any other codec. We discovered that this was due to the IntBlock interface implementation that:
- was copying the buffer bytearray two times (one time from the disk to the buffer, then another time from the buffer to the IntBlock codec).
- had to perform more work wrt to check each of the buffer (IntBlock buffer, IndexInput buffer).
But this might have been improved since then. Michael told me he worked on a new version of the IntBlock interface which was more performant.

{quote}
So, if we 'group' the long values so we are e.g. reading say N long values
at once in a single internal 'block', I think we might get more efficiency
via the I/O system, and also less overhead from the bulkpostings apis.
{quote}

If I understand, this is similar to increasing the boundaries of the variable block size. Indeed, it incurs some non-negligible overhead to perform a block read for each simple64 long word (simple64 frame), and this might be better to read more than one per block read.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990535#comment-12990535 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
for the gov collection's doc file?
{quote}
our results were performed on the wikipedia dataset and blogs dataset. I don;t know what was our selection rate, I was just referring to the gain in overall compression rate.

{quote}
But still, for the all 1's case I'm not actually thinking about unstructured text so much...
in this case I am thinking about metadata fields and more structured data?
{quote}

Yes, this makes sense. In the context of SIREn (kind of simple xml node based inverted index) which is meant for indexing semi-structured data, the difference was more observable (mainly on the frequency and position files, as well as other structure node files).
This might be also useful on the document id file for very common terms (maybe for certain type of facets, with a very few number of values covering a large portion of the document collection).

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990439#comment-12990439 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

We are still not using 2 of the simple64 selectors, but in the simple64 paper it discusses
using these wasted bits to represent 120 and 240 "all ones"... I think we should do this?


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990634#comment-12990634 ] 

Paul Elschot commented on LUCENE-2886:
--------------------------------------

On the facility for allOnes in AFOR-3: one of the reasons why this appears to be of rather little use is that current analyzers do not index stop words. They do this for two reasons, index size and query time, both based on VByte. With an allOnes facility the first reason disappears almost completely, and query times can be also be guarded in other ways, for example by checking for document frequency and then trying to fall back on digrams.
So the message is: please keep this in.


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 10:42 AM:
----------------------------------------------------------------

Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. But this can be tedious task with only a limited benefit.

      was (Author: renaud.delbru):
    Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add.
  
> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2886:
---------------------------------------

    Attachment: LUCENE-2886_simple64_varint.patch

New patch, including Robert's patch, and also adding Simple64 as a VarInt codec.  We badly need more testing of the VarInt cases, so it's great Simple64 came along (thanks Renaud!).

All tests pass w/ Simple64VarInt codec.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990568#comment-12990568 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Hi Michael, 
the first results are not that impressive. 
* Could you tell me what is BulkVInt ? Is it the simple VInt codec implemented on top of the Bulk branch ?
* What the difference between '+united +states' and '+nebraska +states' ? Is nebraska a low frequency term ?

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2886:
--------------------------------

    Attachment: LUCENE-2886.patch

I spent some more time on Simple64, took Mike's previous patch and added some minor improvements:

# Switched the decoding logic to the "Simple-8-4b" referred to in the paper. This is the same encoding, but we process with ints instead of longs.
# Because our buffers are so tiny (for example 32 bytes), the overhead of NIO hurts rather than helps, so I switched to native arrays.

The performance is looking much more reasonable. Here's my tests on windows, maybe i can convince Mike to sanity-check it on linux.

64-bit SimpleFS
||Query||QPS BulkVInt||QPS Simple64VarInt4||Pct diff||||
|"united states"~3|3.79|3.67|{color:red}-3.3%{color}|
|doctitle:.*[Uu]nited.*|2.45|2.46|{color:green}0.3%{color}|
|spanNear([unit, state], 10, true)|22.13|22.64|{color:green}2.3%{color}|
|uni*|14.04|14.43|{color:green}2.7%{color}|
|united~0.75|6.83|7.04|{color:green}3.2%{color}|
|unit*|25.39|26.21|{color:green}3.2%{color}|
|doctimesecnum:[10000 TO 60000]|8.83|9.16|{color:green}3.6%{color}|
|united~0.6|4.29|4.47|{color:green}4.2%{color}|
|united states|9.35|9.74|{color:green}4.2%{color}|
|un*d|12.88|13.50|{color:green}4.8%{color}|
|"united states"|6.86|7.21|{color:green}5.1%{color}|
|unit~0.7|14.11|14.85|{color:green}5.3%{color}|
|unit~0.5|8.17|8.60|{color:green}5.3%{color}|
|u*d|5.70|6.05|{color:green}6.1%{color}|
|states|30.02|31.90|{color:green}6.3%{color}|
|spanFirst(unit, 5)|86.56|94.15|{color:green}8.8%{color}|
|+united +states|11.10|12.55|{color:green}13.1%{color}|
|+nebraska +states|46.72|57.90|{color:green}23.9%{color}|

32-bit SimpleFS
||Query||QPS BulkVInt||QPS Simple64VarInt4||Pct diff||||
|spanFirst(unit, 5)|95.67|91.02|{color:red}-4.9%{color}|
|"united states"|5.47|5.25|{color:red}-4.1%{color}|
|"united states"~3|3.37|3.32|{color:red}-1.6%{color}|
|unit*|20.45|20.33|{color:red}-0.6%{color}|
|uni*|11.10|11.06|{color:red}-0.3%{color}|
|doctimesecnum:[10000 TO 60000]|7.15|7.16|{color:green}0.0%{color}|
|doctitle:.*[Uu]nited.*|2.26|2.27|{color:green}0.4%{color}|
|unit~0.5|7.73|7.77|{color:green}0.5%{color}|
|un*d|10.80|10.87|{color:green}0.6%{color}|
|united~0.75|6.77|6.97|{color:green}2.8%{color}|
|unit~0.7|12.97|13.41|{color:green}3.4%{color}|
|united~0.6|4.10|4.26|{color:green}3.7%{color}|
|u*d|4.91|5.10|{color:green}4.0%{color}|
|spanNear([unit, state], 10, true)|20.50|21.72|{color:green}5.9%{color}|
|states|30.00|33.15|{color:green}10.5%{color}|
|+united +states|9.71|10.78|{color:green}11.1%{color}|
|united states|9.65|10.96|{color:green}13.6%{color}|
|+nebraska +states|43.93|54.38|{color:green}23.8%{color}|

64-bit MMap
||Query||QPS BulkVInt||QPS Simple64VarInt4||Pct diff||||
|"united states"|8.99|8.41|{color:red}-6.4%{color}|
|states|38.21|36.16|{color:red}-5.4%{color}|
|spanFirst(unit, 5)|118.11|112.19|{color:red}-5.0%{color}|
|doctimesecnum:[10000 TO 60000]|10.78|10.35|{color:red}-4.0%{color}|
|spanNear([unit, state], 10, true)|33.78|32.51|{color:red}-3.7%{color}|
|"united states"~3|4.68|4.54|{color:red}-3.0%{color}|
|unit*|30.00|29.26|{color:red}-2.4%{color}|
|uni*|17.48|17.06|{color:red}-2.4%{color}|
|united states|11.60|11.35|{color:red}-2.1%{color}|
|+united +states|13.95|14.08|{color:green}1.0%{color}|
|united~0.75|10.76|10.87|{color:green}1.1%{color}|
|united~0.6|7.75|7.88|{color:green}1.7%{color}|
|un*d|17.16|17.66|{color:green}2.9%{color}|
|doctitle:.*[Uu]nited.*|3.85|3.98|{color:green}3.3%{color}|
|unit~0.7|27.00|28.08|{color:green}4.0%{color}|
|unit~0.5|16.64|17.46|{color:green}4.9%{color}|
|u*d|8.68|9.31|{color:green}7.2%{color}|
|+nebraska +states|83.30|96.53|{color:green}15.9%{color}|



> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993053#comment-12993053 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

Thanks Mike... looks improved over your previous results.

Should we commit this to bulk branch? I think we should also remove the "fixed-int" version 1.0 of Simple64 we have there...

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990543#comment-12990543 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

Thanks for those numbers Renaud... yes the cases you see in e.g. Geonames
was one example of what I was thinking: in general you might say someone
should be using "omitTFAP" to omit freqs and positions for these fields,
but they might not be able to do this, if they want to support e.g. phrase
queries like "washington hill". So if we can pack long streams of 1s with 
freqs and positions I think this is probably useful for a lot of people.

Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is :) Because I would think if you take e.g. Geonames, the place 
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings. 

I'm not saying we could rely upon this, but i do think in general lots
of people's docs aren't in completely random order, and its probably
common to see long streams of docdelta=1 in structured data for this reason?




> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2886:
---------------------------------------

    Attachment: LUCENE-2886.patch

New patch, cuts over to bulk-reading the byte[] and then pulling a LongBuffer from that.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990143#comment-12990143 ] 

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

OK I committed the two new Simple64 codecs (to bulk branch).

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538 ] 

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM:
----------------------------------------------------------------

Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.216|0.142|0.143|0.143|0.143|0.929|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.

      was (Author: renaud.delbru):
    Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset: 

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.
  
> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990548#comment-12990548 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
{quote}
Yes, if the overhead is minimal, it might not be an issue in certain cases.

{quote}
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is 
{quote}

Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as the document id). It is just that I have changed the name of the concept, as SIREn is manipulating principally entity and not document. In my case, an entity is just a set of attribute-value pairs, similarly to a document in Lucene.

{quote}
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings. 
{quote}
I checked, and Geonames dataset was alphabetically sorted by url names:
http://sws.geonames.org/1/
http://sws.geonames.org/10/
...
as well as dbpedia and sindice.

So, yes, this might have (good) consequences on the docdelta list for certain datasets such as geonames. And especially when indexing semi-structured data, as the schema of the data in one dataset is generally identical across entities/documents. therefore it is likely to see long runs of 1 for certain terms or schema terms.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990534#comment-12990534 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

bq. Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. 

In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
for the gov collection's doc file?

But still, for the all 1's case I'm not actually thinking about unstructured text so much... 
in this case I am thinking about metadata fields and more structured data?


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990570#comment-12990570 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

Hi Renaud:

The BulkVInt codec is VInt implemented as a FixedIntBlock codec.
So it reads a single numBytes Vint header, then a byte[], and decodes BLOCKSIZE vints out of it.
The reason for this, is it has much different performance than "StandardCodec",
due to the fact StandardCodec has to readByte() readByte() readByte() ...

You can see the code here: http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/index/codecs/bulkvint/BulkVIntCodec.java

One reason for this, is to differentiate performance improvements of actual compression
algorithms from the way that they do their underlying I/O... previously various codecs
looked much faster than Vint but a lot of the reason for this is due to the way Vint
was implemented...

And yes, you are correct nebraska is a lower freq term. the +united +states is a more 
"normal" phrase query, but +nebraska +states is a phrase query intended to do a lot 
of advance()'ing... 


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2886:
--------------------------------

    Attachment: LUCENE-2886_simple64.patch

I pulled out the simple64 implementation here and adapted it to the bulkpostings branch.

Thanks for uploading this code Renaud, its great and the code is easy to work with. I hope to get some more of the codecs you wrote into the branch for testing.

I changed a few things that helped in benchmarking:
* the decoder uses relative gets instead of absolute
* we write #longs in the block header instead of #bytes (as its always long aligned, but smaller numbers)

But mainly, for this one I think we should change it to be a VariableIntBlock codec... right now it packs 128 integers into as few longs as possible, but this is wasteful for two reasons: it has to write a per-block byte header, and also wastes bits (e.g. in the case of a block of 128 1's).

With variableintblock, we could do this differently, e.g. read a fixed number of longs per-block (say 4 longs), and our block would then be variable between 4 and 240 integers depending upon data.


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2886:
---------------------------------------

    Attachment: LUCENE-2886.patch

Attached patch, adding block multiplier to Simple64VarInt.

I haven't tested perf impact yet...

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886.patch, LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990555#comment-12990555 ] 

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

I test perf of BulkVInt vs Simple64VarInt (Linux, 10M wikipedia index, NIOFSDir, multi-seg, java -server -Xbatch -Xmx2g -Xms2g):

||Query||QPS bulkvint||QPS simple64varint||Pct diff||||
|+united +states|19.46|16.47|{color:red}-15.4%{color}|
|"united states"|11.92|10.09|{color:red}-15.3%{color}|
|unit~0.5|17.35|16.96|{color:red}-2.2%{color}|
|"united states"~3|5.70|5.72|{color:green}0.3%{color}|
|united~0.6|8.24|8.31|{color:green}0.8%{color}|
|doctitle:.*[Uu]nited.*|5.36|5.48|{color:green}2.1%{color}|
|united~0.75|12.45|12.75|{color:green}2.4%{color}|
|unit~0.7|27.37|28.08|{color:green}2.6%{color}|
|spanFirst(unit, 5)|156.48|162.78|{color:green}4.0%{color}|
|united states|15.35|16.07|{color:green}4.7%{color}|
|spanNear([unit, state], 10, true)|42.84|46.11|{color:green}7.6%{color}|
|states|48.13|52.03|{color:green}8.1%{color}|
|unit*|32.80|37.07|{color:green}13.0%{color}|
|doctimesecnum:[10000 TO 60000]|12.38|14.16|{color:green}14.4%{color}|
|uni*|18.66|21.53|{color:green}15.3%{color}|
|u*d|9.66|11.17|{color:green}15.7%{color}|
|un*d|19.00|22.04|{color:green}16.0%{color}|
|+nebraska +states|101.66|141.83|{color:green}39.5%{color}|


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Renaud Delbru (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509 ] 

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add.

> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993048#comment-12993048 ] 

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

Linux results, NIOFSDir:


||Query||QPS bulkvint||QPS simple64x4||Pct diff||||
|"united states"|11.95|10.50|{color:red}-12.1%{color}|
|+united +states|19.45|18.94|{color:red}-2.6%{color}|
|unit~0.5|18.20|17.85|{color:red}-1.9%{color}|
|doctitle:.*[Uu]nited.*|5.38|5.40|{color:green}0.4%{color}|
|spanNear([unit, state], 10, true)|43.59|44.31|{color:green}1.7%{color}|
|united~0.6|8.48|8.72|{color:green}2.8%{color}|
|"united states"~3|5.75|5.96|{color:green}3.6%{color}|
|united~0.75|12.66|13.14|{color:green}3.8%{color}|
|unit~0.7|27.58|28.84|{color:green}4.6%{color}|
|spanFirst(unit, 5)|157.05|165.43|{color:green}5.3%{color}|
|united states|15.41|16.26|{color:green}5.6%{color}|
|states|48.10|51.95|{color:green}8.0%{color}|
|unit*|32.72|36.24|{color:green}10.8%{color}|
|uni*|18.66|20.97|{color:green}12.3%{color}|
|u*d|9.63|10.86|{color:green}12.8%{color}|
|+nebraska +states|103.95|117.66|{color:green}13.2%{color}|
|un*d|18.91|21.45|{color:green}13.4%{color}|
|doctimesecnum:[10000 TO 60000]|12.33|13.99|{color:green}13.5%{color}|


And MMapDir:

||Query||QPS bulkvint||QPS simple64x4||Pct diff||||
|"united states"|12.86|11.01|{color:red}-14.4%{color}|
|+nebraska +states|129.23|126.21|{color:red}-2.3%{color}|
|"united states"~3|6.05|6.06|{color:green}0.1%{color}|
|united~0.6|10.55|10.72|{color:green}1.6%{color}|
|unit~0.7|38.24|39.00|{color:green}2.0%{color}|
|+united +states|19.29|19.71|{color:green}2.2%{color}|
|united~0.75|15.08|15.52|{color:green}2.9%{color}|
|united states|16.47|16.95|{color:green}2.9%{color}|
|doctitle:.*[Uu]nited.*|6.23|6.46|{color:green}3.8%{color}|
|spanNear([unit, state], 10, true)|47.26|49.44|{color:green}4.6%{color}|
|unit*|36.28|38.01|{color:green}4.7%{color}|
|doctimesecnum:[10000 TO 60000]|14.02|14.90|{color:green}6.3%{color}|
|uni*|21.08|22.46|{color:green}6.6%{color}|
|spanFirst(unit, 5)|169.47|180.70|{color:green}6.6%{color}|
|unit~0.5|23.03|24.79|{color:green}7.6%{color}|
|un*d|21.40|23.64|{color:green}10.5%{color}|
|states|50.60|56.17|{color:green}11.0%{color}|
|u*d|11.16|12.59|{color:green}12.8%{color}|


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886.patch, LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990560#comment-12990560 ] 

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

Same as last perf test, except MMapDir instead:

||Query||QPS bulkvint||QPS simple64varint||Pct diff||||
|"united states"|12.81|9.95|{color:red}-22.3%{color}|
|"united states"~3|6.04|5.35|{color:red}-11.4%{color}|
|+united +states|19.18|17.25|{color:red}-10.1%{color}|
|united~0.6|10.35|10.25|{color:red}-0.9%{color}|
|united states|16.33|16.29|{color:red}-0.2%{color}|
|doctimesecnum:[10000 TO 60000]|14.18|14.24|{color:green}0.4%{color}|
|unit*|36.69|37.20|{color:green}1.4%{color}|
|united~0.75|14.71|14.92|{color:green}1.4%{color}|
|unit~0.5|22.15|22.55|{color:green}1.8%{color}|
|uni*|21.27|21.73|{color:green}2.2%{color}|
|spanFirst(unit, 5)|166.18|171.96|{color:green}3.5%{color}|
|doctitle:.*[Uu]nited.*|6.24|6.46|{color:green}3.6%{color}|
|unit~0.7|34.18|35.49|{color:green}3.8%{color}|
|spanNear([unit, state], 10, true)|47.16|49.53|{color:green}5.0%{color}|
|states|50.71|53.52|{color:green}5.5%{color}|
|un*d|21.51|22.96|{color:green}6.7%{color}|
|u*d|11.18|12.22|{color:green}9.3%{color}|
|+nebraska +states|127.16|161.43|{color:green}26.9%{color}|


> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990573#comment-12990573 ] 

Robert Muir commented on LUCENE-2886:
-------------------------------------

Here are my performance results, same setup (-server, etc)/index etc as Mike except
I'm using windows 64-bit (1.6.0_23/Vista), and I do not have an SSD, just a normal hard disk.
Using SimpleFS:
||Query||QPS bulkVint||QPS simple64Varint||Pct diff||||
|+united +states|10.95|10.32|{color:red}-5.8%{color}|
|"united states"~3|3.78|3.67|{color:red}-2.7%{color}|
|"united states"|6.81|6.71|{color:red}-1.5%{color}|
|doctimesecnum:[10000 TO 60000]|8.76|8.81|{color:green}0.5%{color}|
|spanNear([unit, state], 10, true)|22.42|22.92|{color:green}2.3%{color}|
|uni*|13.75|14.22|{color:green}3.4%{color}|
|unit*|24.89|25.86|{color:green}3.9%{color}|
|united~0.75|7.06|7.36|{color:green}4.4%{color}|
|unit~0.7|14.11|14.73|{color:green}4.4%{color}|
|united~0.6|4.34|4.53|{color:green}4.4%{color}|
|unit~0.5|8.16|8.55|{color:green}4.7%{color}|
|states|29.99|31.85|{color:green}6.2%{color}|
|spanFirst(unit, 5)|88.27|93.78|{color:green}6.2%{color}|
|united states|9.97|10.60|{color:green}6.4%{color}|
|doctitle:.*[Uu]nited.*|2.34|2.49|{color:green}6.6%{color}|
|u*d|5.54|6.05|{color:green}9.3%{color}|
|un*d|12.36|13.59|{color:green}9.9%{color}|
|+nebraska +states|47.58|62.53|{color:green}31.4%{color}|

Using MMap:
||Query||QPS bulkVint||QPS simple64Varint||Pct diff||||
|+united +states|13.88|11.95|{color:red}-13.9%{color}|
|spanFirst(unit, 5)|123.50|106.84|{color:red}-13.5%{color}|
|"united states"~3|4.64|4.17|{color:red}-10.1%{color}|
|"united states"|8.90|8.19|{color:red}-8.0%{color}|
|doctimesecnum:[10000 TO 60000]|10.72|9.98|{color:red}-7.0%{color}|
|spanNear([unit, state], 10, true)|33.70|31.63|{color:red}-6.1%{color}|
|uni*|17.26|16.30|{color:red}-5.5%{color}|
|states|37.14|35.40|{color:red}-4.7%{color}|
|unit*|29.96|28.57|{color:red}-4.7%{color}|
|united states|11.52|11.22|{color:red}-2.6%{color}|
|united~0.75|10.87|10.75|{color:red}-1.1%{color}|
|united~0.6|7.69|7.68|{color:red}-0.1%{color}|
|doctitle:.*[Uu]nited.*|3.83|3.91|{color:green}2.2%{color}|
|un*d|16.83|17.26|{color:green}2.6%{color}|
|unit~0.5|16.52|17.05|{color:green}3.2%{color}|
|u*d|8.51|9.08|{color:green}6.7%{color}|
|unit~0.7|26.83|28.93|{color:green}7.8%{color}|
|+nebraska +states|83.69|113.19|{color:green}35.2%{color}|



> Adaptive Frame Of Reference 
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch. 
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org