You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Amarnath Ramamoorthi (Jira)" <ji...@apache.org> on 2019/11/12 14:57:00 UTC
[jira] [Comment Edited] (PHOENIX-5490) Queries with leading part of the primary key doing a parallel scan causes missing data

    [ https://issues.apache.org/jira/browse/PHOENIX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938697#comment-16938697 ] 

Amarnath Ramamoorthi edited comment on PHOENIX-5490 at 11/12/19 2:56 PM:
-------------------------------------------------------------------------

[~larsh] , Thanks for the response.

The closest I could try replicate is by using this approach.

I have attached [^foo_bigtable.zip] file (hbase export of the table). [https://hbase.apache.org/book.html#export]
{code:java}
unzip foo_bigtable.zip
hadoop fs -put FOO_BIGTABLE /tmp/FOO_BIGTABLE/
{code}
{code:java|title=Via phoenix}
CREATE TABLE IF NOT EXISTS "FOO"."BIG_TABLE" (
  "id_1"               BIGINT NOT NULL,
  "id_2"               BIGINT NOT NULL,
  "id_3"               BIGINT NOT NULL,
  "id_4"               BIGINT NOT NULL,
  "id_5"               BIGINT NOT NULL,
  "id_6"               BIGINT NOT NULL,
  "id"               BIGINT NOT NULL,
  "id_7"               BIGINT,
  "id_8"               BIGINT,
  CONSTRAINT "demo_pk" PRIMARY KEY ("id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id")
)SALT_BUCKETS=10, DEFAULT_COLUMN_FAMILY='BT', COMPRESSION='SNAPPY', NORMALIZATION_ENABLED=true;
{code}
Import using
{code:java}
hbase org.apache.hadoop.hbase.mapreduce.Import "FOO.BIG_TABLE" /tmp/FOO_BIGTABLE/
{code}
After import, recommend doing
 * major_compaction (via hbase shell)
 * update statistics (via phoenix)

{code:java|title=Queries for testing}
SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375;

SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375 and "id_2" = 1306309960 and "id_3" = 4 and "id_4" = 365224322 and "id_5" = 359338115 and "id_6" = 523433838 and "id" = 1052445538;

SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375 and "id_2" = 1306309959 and "id_3" = 4 and "id_4" = 365224321 and "id_5" = 359338114 and "id_6" = 523433837 and "id" = 1052445537;
{code}
Additional Informations
 * SPLIT POLICY = ConstantSizeRegionSplitPolicy
 * SALT_BUCKETS = 10 (on creation)
 * ENCODING = FAST_DIFF
 * Current region size of this table = 197
 * Data Size = 234.9 G
 * Average Region Size = 956 MB / region

h4. Question

Is salting and normalization bad together?

Also noticed that, creating a table with SALT_BUCKETS 10 and enabling normalization sometimes reduces the regions to 6 or less.

Also: [https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html]
{quote}Region sizes are coarse and approximate on the order of megabytes. Additionally, "empty" regions (less than 1MB, with the previous note) are not merged away. This is by design to prevent normalization from undoing the pre-splitting of a table.
{quote}
 
 Given the above statement, normalizer should not reduce pre-splitting data, which is SALT_BUCKETS 10 in our case. Am I wrong here ?

 


was (Author: aramamoorthi):
[~lhofhansl], Thanks for the response.

The closest I could try replicate is by using this approach.

I have attached [^foo_bigtable.zip] file (hbase export of the table). [https://hbase.apache.org/book.html#export]
{code:java}
unzip foo_bigtable.zip
hadoop fs -put FOO_BIGTABLE /tmp/FOO_BIGTABLE/
{code}
{code:java|title=Via phoenix}
CREATE TABLE IF NOT EXISTS "FOO"."BIG_TABLE" (
  "id_1"               BIGINT NOT NULL,
  "id_2"               BIGINT NOT NULL,
  "id_3"               BIGINT NOT NULL,
  "id_4"               BIGINT NOT NULL,
  "id_5"               BIGINT NOT NULL,
  "id_6"               BIGINT NOT NULL,
  "id"               BIGINT NOT NULL,
  "id_7"               BIGINT,
  "id_8"               BIGINT,
  CONSTRAINT "demo_pk" PRIMARY KEY ("id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id")
)SALT_BUCKETS=10, DEFAULT_COLUMN_FAMILY='BT', COMPRESSION='SNAPPY', NORMALIZATION_ENABLED=true;
{code}
Import using
{code:java}
hbase org.apache.hadoop.hbase.mapreduce.Import "FOO.BIG_TABLE" /tmp/FOO_BIGTABLE/
{code}
After import, recommend doing
 * major_compaction (via hbase shell)
 * update statistics (via phoenix)

{code:java|title=Queries for testing}
SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375;

SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375 and "id_2" = 1306309960 and "id_3" = 4 and "id_4" = 365224322 and "id_5" = 359338115 and "id_6" = 523433838 and "id" = 1052445538;

SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "FOO"."BIG_TABLE" WHERE "id_1" = 60268375 and "id_2" = 1306309959 and "id_3" = 4 and "id_4" = 365224321 and "id_5" = 359338114 and "id_6" = 523433837 and "id" = 1052445537;
{code}
Additional Informations
 * SPLIT POLICY = ConstantSizeRegionSplitPolicy
 * SALT_BUCKETS = 10 (on creation)
 * ENCODING = FAST_DIFF
 * Current region size of this table = 197
 * Data Size = 234.9 G
 * Average Region Size = 956 MB / region

h4. Question

Is salting and normalization bad together?

Also noticed that, creating a table with SALT_BUCKETS 10 and enabling normalization sometimes reduces the regions to 6 or less.

Also: [https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html]
{quote}Region sizes are coarse and approximate on the order of megabytes. Additionally, "empty" regions (less than 1MB, with the previous note) are not merged away. This is by design to prevent normalization from undoing the pre-splitting of a table.
{quote}
 
Given the above statement, normalizer should not reduce pre-splitting data, which is SALT_BUCKETS 10 in our case. Am I wrong here ?

 

> Queries with leading part of the primary key doing a parallel scan causes missing data
> --------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5490
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5490
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Amarnath Ramamoorthi
>            Priority: Major
>         Attachments: foo_bigtable.zip
>
>
> Table properties
>  * saltbucket_size=10
>  * compression=snappy
>  * normalization_enabled=true
>  * SPLIT POLICY = ConstantSizeRegionSplitPolicy
>  * SALT_BUCKETS = 10 (on creation)
>  * ENCODING = FAST_DIFF
> PRIMARY KEY ("id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id")
> Partial primary key lookup is not working as expected.
> On using only one key "id_1" = 60268375; should return 2 rows
> {code:java|title=Parallel scan shows 1 rows}
> 0: jdbc:phoenix:labs-test-namenode-lv-101,labs> explain SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "foo"."BIG_TABLE" WHERE "id_1" = 60268375;
> +--------------------------------------------------------------------------------------------------------------------------+
> |                                                           PLAN                                                           |
> +--------------------------------------------------------------------------------------------------------------------------+
> | CLIENT 8-CHUNK 0 ROWS 0 BYTES PARALLEL 8-WAY ROUND ROBIN RANGE SCAN OVER foo.BIG_TABLE [0,60268375] - [9,60268375]  |
> |     SERVER FILTER BY FIRST KEY ONLY                                                                                      |
> +--------------------------------------------------------------------------------------------------------------------------+
> 2 rows selected (0.047 seconds)
> 0: jdbc:phoenix:labs-test-namenode-lv-101,labs> SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "foo"."BIG_TABLE" WHERE "id_1" = 60268375;
> +-----------+-------------+--------+------------+------------+------------+-------------+
> |   id_1   |    id_2    | id_3  |   id_4    |   id_5   |  id_6   |     id      |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> | 60268375  | 1306309959  | 4      | 365224321  | 359338114  | 523433837  | 1052445537  |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> 1 row selected (3.24 seconds)
> {code}
> {code:java|title=Serial scan shows 2 rows}
> 0: jdbc:phoenix:labs-test-namenode-lv-101,labs>  explain SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id"  FROM "foo"."BIG_TABLE" WHERE "id_1" = 60268375 limit 10;
> +-----------------------------------------------------------------------------------------------------------+
> |                                                   PLAN                                                    |
> +-----------------------------------------------------------------------------------------------------------+
> | CLIENT 10-CHUNK SERIAL 10-WAY ROUND ROBIN RANGE SCAN OVER foo.BIG_TABLE [0,60268375] - [9,60268375]  |
> |     SERVER FILTER BY FIRST KEY ONLY                                                                       |
> |     SERVER 10 ROW LIMIT                                                                                   |
> | CLIENT 10 ROW LIMIT                                                                                       |
> +-----------------------------------------------------------------------------------------------------------+
> 0: jdbc:phoenix:labs-test-namenode-lv-101,labs> SELECT "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "foo"."BIG_TABLE" WHERE "id_1" = 60268375 limit 10;
> +-----------+-------------+--------+------------+------------+------------+-------------+
> |   id_1   |    id_2    | id_3  |   id_4    |   id_5   |  id_6   |     id      |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> | 60268375  | 1306309960  | 4      | 365224322  | 359338115  | 523433838  | 1052445538  |
> | 60268375  | 1306309959  | 4      | 365224321  | 359338114  | 523433837  | 1052445537  |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> 2 rows selected (0.12 seconds)
> {code}
> Now the missing row can be retrieved on using full primary key.
> {code:java}
> 0: jdbc:phoenix:labs-test-namenode-lv-101,labs> select "id_1", "id_2", "id_3", "id_4", "id_5", "id_6", "id" FROM "foo"."BIG_TABLE" WHERE "id_1" IN (60268375) AND "id_2" IN (1306309960) AND "id_3" = 4 AND "id_4" = 365224322 AND "id_5" = 359338115 AND "id_6" = 523433838 AND "id" = 1052445538;
> +-----------+-------------+--------+------------+------------+------------+-------------+
> |   id_1   |    id_2    | id_3  |   id_4    |   id_5   |  id_6   |     id      |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> | 60268375  | 1306309960  | 4      | 365224322  | 359338115  | 523433838  | 1052445538  |
> +-----------+-------------+--------+------------+------------+------------+-------------+
> 1 row selected (0.07 seconds)
> {code}
> – Phoenix (version 4.13)
>  – HBase (Version 1.2.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)