You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Bin Shi (JIRA)" <ji...@apache.org> on 2018/10/26 04:46:00 UTC
[jira] [Comment Edited] (PHOENIX-4594) Perform binary search on
guideposts during query compilation
[ https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664645#comment-16664645 ]
Bin Shi edited comment on PHOENIX-4594 at 10/26/18 4:45 AM:
------------------------------------------------------------
[~lhofhansl], let me continue the discussion of the prefix encoding.
I evaluated the benefit of compressing guideposts in prefix encoding. You can find details at [https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.]
Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*
I mainly evaluated the benefit by using the following typical types of data:
# Case 1: Primary Key is Sequence in INT (4 Bytes).
When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size actually increased 7.14% after compression.
When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size actually increased 3.6% after compression.
# Case 2: Primary Key is Sequence in BIGINT (8 Bytes).
When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced 6.25% after compression.
When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced increased 9.4% after compression.
# Case 3: Real Data From Platform Team
With the data known so far, after compression with prefix encoding, the lower bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously refine the calculation in this part after I know more about the real data.
# Case 4: Primary Key is Reverse URL
This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have it. I don’t have real data for this case, but intuitively, this might be one of typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
# We should allow customer to choose different compression algorithms or encoding schemes, and make it configurable.
Obviously, case 1 and case 2 are negative cases. As Jacob pointed out, double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix encoding mightn’t the best one to make tradeoff between performance and compression ratio.
# We should split guideposts in chunks and always encoding/decoding a chunk as a whole while allowing random access across chunks. In this way, we can only cache/fetch part of guideposts of the table and facilitate tenant/view specific query.
was (Author: bin shi):
[~lhofhansl], let me continue the discussion of the prefix encoding.
I evaluated the benefit of compressing guideposts in prefix encoding. You can find details at [https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.]
Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*
I mainly evaluated the benefit by using the following typical types of data:
# Case 1: Primary Key is Sequence in INT (4 Bytes)
# When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size actually increased 7.14% after compression.
# When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size actually increased 3.6% after compression.
# Case 2: Primary Key is Sequence in BIGINT (8 Bytes)
# When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced 6.25% after compression.
# When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced increased 9.4% after compression.
# Case 3: Real Data From Platform Team
With the data known so far, after compression with prefix encoding, the lower bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously refine the calculation in this part after I know more about the real data.
# Case 4: Primary Key is Reverse URL
This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have it. I don’t have real data for this case, but intuitively, this might be one of typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
# We should allow customer to choose different compression algorithms or encoding schemes, and make it configurable.
Obviously, case 1 and case 2 are negative cases. As Jacob pointed out, double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix encoding mightn’t the best one to make tradeoff between performance and compression ratio.
# We should split guideposts in chunks and always encoding/decoding a chunk as a whole while allowing random access across chunks. In this way, we can only cache/fetch part of guideposts of the table and facilitate tenant/view specific query.
> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
> Key: PHOENIX-4594
> URL: https://issues.apache.org/jira/browse/PHOENIX-4594
> Project: Phoenix
> Issue Type: Improvement
> Reporter: James Taylor
> Assignee: Bin Shi
> Priority: Major
> Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch, PHOENIX-4594_0918.patch
>
>
> If there are many guideposts, performance will suffer during query compilation because we do a linear search of the guideposts to find the intersection with the scan ranges. Instead, in BaseResultIterators.getParallelScans() we should populate an array of guideposts and perform a binary search.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)