You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Bin Shi (JIRA)" <ji...@apache.org> on 2018/10/26 04:46:00 UTC
[jira] [Comment Edited] (PHOENIX-4594) Perform binary search on guideposts during query compilation

    [ https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664645#comment-16664645 ] 

Bin Shi edited comment on PHOENIX-4594 at 10/26/18 4:45 AM:
------------------------------------------------------------

[~lhofhansl], let me continue the discussion of the prefix encoding.

I evaluated the benefit of compressing guideposts in prefix encoding. You can find details at [https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.] 

Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*

I mainly evaluated the benefit by using the following typical types of data:
 # Case 1: Primary Key is Sequence in INT (4 Bytes).
When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size actually increased 7.14% after compression.
When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size actually increased 3.6% after compression.
 # Case 2: Primary Key is Sequence in BIGINT (8 Bytes).
When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced 6.25% after compression.
When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced increased 9.4% after compression.
 # Case 3: Real Data From Platform Team
With the data known so far, after compression with prefix encoding, the lower bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously refine the calculation in this part after I know more about the real data.
 # Case 4: Primary Key is Reverse URL

This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have it. I don’t have real data for this case, but intuitively, this might be one of typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
 # We should allow customer to choose different compression algorithms or encoding schemes, and make it configurable. 
Obviously, case 1 and case 2 are negative cases. As Jacob pointed out, double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix encoding mightn’t the best one to make tradeoff between performance and compression ratio.
 

 # We should split guideposts in chunks and always encoding/decoding a chunk as a whole while allowing random access across chunks. In this way, we can only cache/fetch part of guideposts of the table and facilitate tenant/view specific query.


was (Author: bin shi):
[~lhofhansl], let me continue the discussion of the prefix encoding.

I evaluated the benefit of compressing guideposts in prefix encoding. You can find details at [https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.] 

Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*

I mainly evaluated the benefit by using the following typical types of data:
 # Case 1: Primary Key is Sequence in INT (4 Bytes)
 # When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size actually increased 7.14% after compression.
 # When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size actually increased 3.6% after compression.


 # Case 2: Primary Key is Sequence in BIGINT (8 Bytes)
 # When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced 6.25% after compression.
 # When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced increased 9.4% after compression.


 # Case 3: Real Data From Platform Team

With the data known so far, after compression with prefix encoding, the lower bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously refine the calculation in this part after I know more about the real data.
 # Case 4: Primary Key is Reverse URL

This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have it. I don’t have real data for this case, but intuitively, this might be one of typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
 # We should allow customer to choose different compression algorithms or encoding schemes, and make it configurable.

Obviously, case 1 and case 2 are negative cases. As Jacob pointed out, double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix encoding mightn’t the best one to make tradeoff between performance and compression ratio.
 # We should split guideposts in chunks and always encoding/decoding a chunk as a whole while allowing random access across chunks. In this way, we can only cache/fetch part of guideposts of the table and facilitate tenant/view specific query.

> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4594
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4594
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Bin Shi
>            Priority: Major
>         Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch, PHOENIX-4594_0918.patch
>
>
> If there are many guideposts, performance will suffer during query compilation because we do a linear search of the guideposts to find the intersection with the scan ranges. Instead, in BaseResultIterators.getParallelScans() we should populate an array of guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)