You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Craig Roberts <cr...@frogasia.com> on 2016/10/26 10:20:38 UTC

Duplicate Records Showing in Apache Phoenix

Hi all,

I have a strange issue with Phoenix/HBase (Phoenix 4.4 and HBase 1.1 in HDP
2.4).

This is my table DDL:

CREATE TABLE IF NOT EXISTS ANALYTICS(
  "school_code" VARCHAR NOT NULL,
  "code" VARCHAR NOT NULL,
  "iso_8601" DATE NOT NULL,
  "uuid" VARCHAR NOT NULL,
  -- Used to ensure the "events" column family exists
  "events"."___" VARCHAR,
  CONSTRAINT id PRIMARY KEY ("school_code", "code", "iso_8601", "uuid")
)

I then load events into the table using a custom Flume sink.


0: jdbc:phoenix:10.124.12.191> SELECT COUNT(*) FROM ANALYTICS;
+------------------------------------------+
|                 COUNT(1)                 |
+------------------------------------------+
| 1248515                                  |
+------------------------------------------+
1 row selected (2.415 seconds)
0: jdbc:phoenix:10.124.12.191> SELECT COUNT(DISTINCT("uuid")) FROM
ANALYTICS;
+------------------------------------------+
|          DISTINCT_COUNT("uuid")          |
+------------------------------------------+
| 773373                                   |
+------------------------------------------+
1 row selected (5.751 seconds)

Now, this is only showing up with lots of events, and I have almost 1
million events.

We've managed to pin down an errant event UUID. The primary key is meant to
ensure that repeated events (Flume's at-least-once guarantees) are
overwritten, rather than duplicated (due to UPSERT). Here's some example
queries for the errant UUID:

0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
'00011783-c870-463c-9c01-4e530a57714f';
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
|               school_code                |
code                   |        iso_8601         |
uuid                   |
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
| WRA0004                                  |
site.view                                | 2016-03-21 16:00:00.000 |
00011783-c870-463c-9c01-4e530a57714f
|
| WRA0004                                  |
site.view                                | 2016-03-21 16:00:00.000 |
00011783-c870-463c-9c01-4e530a57714f
|
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+


Which shows two events with what *should* be a duplicate primary key.

If I pin it down by including all the primary key fields, however, I get
one result:

0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "school_code"
= 'WRA0004' AND "code" = 'site.view' AND "iso_8601" = TO_DATE('2016-03-21
16:00:00') AND "uuid" = '00011783-c870-463c-9c01-4e530a57714f';
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
|               school_code                |
code                   |        iso_8601         |
uuid                   |
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
| WRA0004                                  |
site.view                                | 2016-03-21 16:00:00.000 |
00011783-c870-463c-9c01-4e530a57714f     |
+------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+

Finally, hbase shell is showing me the correct number of records:

hbase(main):001:0> count 'ANALYTICS'
# lots of output
773376 row(s) in 142.3770 seconds

Now, I'll admit, I would have expected this to be "impossible".

Does anybody have any idea how Phoenix can show me two primary keys with a
generic query, but only one for a specific query? And how I might fix this?

Thanks in advance,
*Craig Roberts*
*Senior Developer*

*FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
<http://www.twitter.com/FrogAsia> | Facebook
<http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>

*This message (including any attachments) is for the use of the addressee
only. It may contain private proprietary or legally privileged statements
and information. No confidentiality or privilege is waived or lost by any
mistransmission. If you are not the intended recipient, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print, copy or rely on any part of the message if you
are not the intended recipient. Any views expressed in this message
(including any attachments) are those of the individual sender and not
those of any member of the YTL Group, except where the message states
otherwise and the sender is authorized to state them to be the views of any
such entity.*

Re: AW: Duplicate Records Showing in Apache Phoenix

Posted by Jaanai Zhang <cl...@gmail.com>.
May some fields got incorrectly reflect after upgrading from 4.8 to 4.12,
so it could not print all selected data.


----------------------------------------
   Yun Zhang
   Best regards!


2018-06-18 16:41 GMT+08:00 Azharuddin Shaikh <azharuddin.shaikh@resilinc.com
>:

> Hi,
>
> We have upgraded the phoenix version to 4.12 from 4.8 but now we are facing
> an issue while performing table load using hbase import table utility.
>
> Data is getting imported but when we are performing any select operation on
> the loaded table it is not reflecting any records, only columns are getting
> printed.
>
> We are facing this issue only after performing phoenix version upgrade from
> 4.8  to 4.12.
>
>
>
> --
> Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>

Re: AW: Duplicate Records Showing in Apache Phoenix

Posted by Azharuddin Shaikh <az...@RESILINC.COM>.
Hi,

We have upgraded the phoenix version to 4.12 from 4.8 but now we are facing
an issue while performing table load using hbase import table utility. 

Data is getting imported but when we are performing any select operation on
the loaded table it is not reflecting any records, only columns are getting
printed.

We are facing this issue only after performing phoenix version upgrade from
4.8  to 4.12.  



--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

Re: AW: Duplicate Records Showing in Apache Phoenix

Posted by Azharuddin Shaikh <az...@RESILINC.COM>.
Thanks JeongDae Kim,

You have pointed out the actual solution to this problem & PPT shared by you
is really helpful. 

We are now trying to upgrade Phoenix version to 4.12 to resolve this
duplicate records issue. Thank You:-)



--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

Re: AW: Duplicate Records Showing in Apache Phoenix

Posted by Jungdae Kim <kj...@gmail.com>.
Hello, Azharuddin.

I have experienced the same issue with Phoenix 4.9.0.

Did you have this issues for the tables that have TTL property or in which
many rows are updated or removed frequently?

In my case,  i figured out that in Phoenix, parallel scans make duplicate
rows, when all regions including the last region after the region that
include a last guidepost, have no guideposts.

The reason for duplicates is that some scans has overlapping scan ranges
from the last guide post in a table to the end key of the last region, by
phoenix bug, in the condition that i already mentioned.
A attached slide describes this issue in detail.

I think you might suffer from the same problem by this bug, because the
issue has not occurred when table has no guideposts by increasing guidepost
with or deleting system.stats.
If you want to make sure this bug causes duplicates, decrease guidepost
width to make guideposts, and check whether last region includes last
guidepost for the table or not. if not, duplicate issue caused by this bug.

To solve this issue, you have to apply a patch for Phoenix 4.7.0 or upgrade
Phoenix version to 4.12.0 or higher.
If you want to apply a patch to your phoenix, please refer to PHOENIX-4007
or a patch file i attached.

I hope my comments is helpful to you.

Regards
JeongDae Kim


김정대 드림.


On Mon, Mar 5, 2018 at 10:12 PM, Azharuddin Shaikh <
azharuddin.shaikh@resilinc.com> wrote:

> Thanks Martin for your reply.
>
> Actually after deleting system.stats also we are getting the same issue
> after few hours. It is occurring continuously.
>
> As of now, we have increased the guidepost width from 100MB to 5GB post
> which issue has not occurred but we are not confidence whether this is the
> exact solution to this problem.
>
> Let us know if there is any other specific solution to this problem and is
> there any trigger point to this issue.
>
>
>
> --
> Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
>

Re: AW: Duplicate Records Showing in Apache Phoenix

Posted by Azharuddin Shaikh <az...@RESILINC.COM>.
Thanks Martin for your reply.

Actually after deleting system.stats also we are getting the same issue
after few hours. It is occurring continuously. 

As of now, we have increased the guidepost width from 100MB to 5GB post
which issue has not occurred but we are not confidence whether this is the
exact solution to this problem.

Let us know if there is any other specific solution to this problem and is
there any trigger point to this issue.



--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

AW: Duplicate Records Showing in Apache Phoenix

Posted by "Hengesbach, Martin" <Ma...@fiz-Karlsruhe.de>.
Hi,

we had the same problem in our Phoenix 4.7 installation two or three times. We got it "solved" with a combination of "update statistics", "delete * from system.stats", rebuilding the indices of the table and restart of hbase. 

The problem is not reproducible at the moment and occurs very rarely. We are using Phoenix intensively with many upserts. Our Phoenix installation has > 1.600 tables and a lot of global indices. I had the impression that the error occurs when the cluster is overloaded but it's only an impression.


Regards
Martin


-----Ursprüngliche Nachricht-----
Von: Azharuddin Shaikh [mailto:azharuddin.shaikh@RESILINC.COM] 
Gesendet: Donnerstag, 1. März 2018 10:09
An: user@phoenix.apache.org
Betreff: Re: Duplicate Records Showing in Apache Phoenix

Hi,

Have you experienced this Duplicate records issue again on your hbase cluster.

Also, have you tried increasing your guidepost value for your phoenix.

Let me know if you have any specific solution to avoid this apart from dropping system.* tables to resolve this issue.

Thanks,

Azharuddin Shaikh



--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/
------------------------------------------------------------------------------

FIZ Karlsruhe - Leibniz-Institut für Informationsinfrastruktur GmbH.
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892.
Geschäftsführerin: Sabine Brünger-Weilandt.
Vorsitzender des Aufsichtsrats: MinDirig Dr. Stefan Luther.


Re: Duplicate Records Showing in Apache Phoenix

Posted by Azharuddin Shaikh <az...@RESILINC.COM>.
Hi,

Have you experienced this Duplicate records issue again on your hbase
cluster.

Also, have you tried increasing your guidepost value for your phoenix.

Let me know if you have any specific solution to avoid this apart from
dropping system.* tables to resolve this issue.

Thanks,

Azharuddin Shaikh



--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

Re: Duplicate Records Showing in Apache Phoenix

Posted by Craig Roberts <cr...@frogasia.com>.
Hi all,

I've managed to fix this by manually dropping the SYSTEM.* tables and
restarting HBase/Phoenix. I did a full reset for this, so I'm not quite
sure if I would have been able to keep my data and just drop the SYSTEM
tables, but when I get some spare time I'll see if I can get a set of
reproducible steps for this.

Thanks for the responses :)

*Craig Roberts*
*Senior Developer*

*FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
<http://www.twitter.com/FrogAsia> | Facebook
<http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>

*This message (including any attachments) is for the use of the addressee
only. It may contain private proprietary or legally privileged statements
and information. No confidentiality or privilege is waived or lost by any
mistransmission. If you are not the intended recipient, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print, copy or rely on any part of the message if you
are not the intended recipient. Any views expressed in this message
(including any attachments) are those of the individual sender and not
those of any member of the YTL Group, except where the message states
otherwise and the sender is authorized to state them to be the views of any
such entity.*

On Fri, Oct 28, 2016 at 12:15 PM, Craig Roberts <cr...@frogasia.com>
wrote:

> Hi Brian,
>
> Thanks for the response.
>
> I've managed to pin it down a bit more precisely.
>
> An SQL query by "uuid" in the Phoenix sqlline.py script (JDBC):
>
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
> 'c21bcfaf-2a2f-4776-9de6-7e25608baf06';
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> |               school_code                |
> code                   |        iso_8601         |
> uuid                   |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> | WRA0004                                  |
> site.view                                | 2016-02-18 16:00:00.000 |
> c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
> | WRA0004                                  |
> site.view                                | 2016-02-18 16:00:00.000 |
> c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
>
> And a HBASE scan for the RowKey:
>
> base(main):001:0> get 'ANALYTICS', "WRA0004\x00site.view\x00\x80\
> x00\x01R\xF5\x1C\x04\x00c21bcfaf-2a2f-4776-9de6-7e25608baf06"
> COLUMN                                                CELL
>  events:_0
> timestamp=1477549465155, value=
>  events:auth.name
> timestamp=1477549465155, value=NURUL296389@Yes.My
>  events:auth.role
> timestamp=1477549465155, value=User
>  events:auth.uuid
> timestamp=1477549465155, value=23B1BEAB200055F315DE1F30
> 3EA4C00141F9704C0A51A527
>  events:date
> timestamp=1477549465155, value=2016-02-19 00:00:00
>  events:migrated
> timestamp=1477549465155, value=\x01
>  events:site_uuid
> timestamp=1477549465155, value=2BAC0082200254FE67D6CF0C
> 5E675D052A3F53BC00260F67
>  events:type
> timestamp=1477549465155, value=site
>  events:user_uuid
> timestamp=1477549465155, value=23B1BEAB200055F315DE1F30
> 3EA4C00141F9704C0A51A527
>
> We've been checking, and re-checking - but HBase is adamant it has 745,281
> rows, while Phoenix is adamant it has 1,527,352...
>
> Is there any situation where Phoenix and HBase row counts will differ?
> Some of our records will be similar, but should have different UUIDs. Does
> Phoenix/HBase de-duplication occur at any point that might cause this?
>
> Thanks for any help you guys can offer,
>
> *Craig Roberts*
> *Senior Developer*
>
> *FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
> 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
> <http://www.twitter.com/FrogAsia> | Facebook
> <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
>
> *This message (including any attachments) is for the use of the addressee
> only. It may contain private proprietary or legally privileged statements
> and information. No confidentiality or privilege is waived or lost by any
> mistransmission. If you are not the intended recipient, please immediately
> delete it and all copies of it from your system, destroy any hard copies of
> it and notify the sender. You must not, directly or indirectly, use,
> disclose, distribute, print, copy or rely on any part of the message if you
> are not the intended recipient. Any views expressed in this message
> (including any attachments) are those of the individual sender and not
> those of any member of the YTL Group, except where the message states
> otherwise and the sender is authorized to state them to be the views of any
> such entity.*
>
> On Wed, Oct 26, 2016 at 6:58 PM, Brian Jeltema <bd...@gmail.com>
> wrote:
>
>> I suspect that both rows exist and that they are different,  but you
>> can’t see the difference due to
>> some non-printing character. Did you try doing the query using JDBC or a
>> direct HBase scan?
>>
>> Brian
>>
>> On Oct 26, 2016, at 6:20 AM, Craig Roberts <cr...@frogasia.com>
>> wrote:
>>
>> Hi all,
>>
>> I have a strange issue with Phoenix/HBase (Phoenix 4.4 and HBase 1.1 in
>> HDP 2.4).
>>
>> This is my table DDL:
>>
>> CREATE TABLE IF NOT EXISTS ANALYTICS(
>>   "school_code" VARCHAR NOT NULL,
>>   "code" VARCHAR NOT NULL,
>>   "iso_8601" DATE NOT NULL,
>>   "uuid" VARCHAR NOT NULL,
>>   -- Used to ensure the "events" column family exists
>>   "events"."___" VARCHAR,
>>   CONSTRAINT id PRIMARY KEY ("school_code", "code", "iso_8601", "uuid")
>> )
>>
>> I then load events into the table using a custom Flume sink.
>>
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(*) FROM ANALYTICS;
>> +------------------------------------------+
>> |                 COUNT(1)                 |
>> +------------------------------------------+
>> | 1248515                                  |
>> +------------------------------------------+
>> 1 row selected (2.415 seconds)
>> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(DISTINCT("uuid")) FROM
>> ANALYTICS;
>> +------------------------------------------+
>> |          DISTINCT_COUNT("uuid")          |
>> +------------------------------------------+
>> | 773373                                   |
>> +------------------------------------------+
>> 1 row selected (5.751 seconds)
>>
>> Now, this is only showing up with lots of events, and I have almost 1
>> million events.
>>
>> We've managed to pin down an errant event UUID. The primary key is meant
>> to ensure that repeated events (Flume's at-least-once guarantees) are
>> overwritten, rather than duplicated (due to UPSERT). Here's some example
>> queries for the errant UUID:
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
>> '00011783-c870-463c-9c01-4e530a57714f';
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> |               school_code                |
>> code                   |        iso_8601         |
>> uuid                   |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>>
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>>
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>>
>>
>> Which shows two events with what *should* be a duplicate primary key.
>>
>> If I pin it down by including all the primary key fields, however, I get
>> one result:
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE
>> "school_code" = 'WRA0004' AND "code" = 'site.view' AND "iso_8601" =
>> TO_DATE('2016-03-21 16:00:00') AND "uuid" = '00011783-c870-463c-9c01-4e530
>> a57714f';
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> |               school_code                |
>> code                   |        iso_8601         |
>> uuid                   |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>>
>> Finally, hbase shell is showing me the correct number of records:
>>
>> hbase(main):001:0> count 'ANALYTICS'
>> # lots of output
>> 773376 row(s) in 142.3770 seconds
>>
>> Now, I'll admit, I would have expected this to be "impossible".
>>
>> Does anybody have any idea how Phoenix can show me two primary keys with
>> a generic query, but only one for a specific query? And how I might fix
>> this?
>>
>> Thanks in advance,
>> *Craig Roberts*
>> *Senior Developer*
>>
>> *FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East
>> | 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
>> <http://www.twitter.com/FrogAsia> | Facebook
>> <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
>>
>> *This message (including any attachments) is for the use of the addressee
>> only. It may contain private proprietary or legally privileged statements
>> and information. No confidentiality or privilege is waived or lost by any
>> mistransmission. If you are not the intended recipient, please immediately
>> delete it and all copies of it from your system, destroy any hard copies of
>> it and notify the sender. You must not, directly or indirectly, use,
>> disclose, distribute, print, copy or rely on any part of the message if you
>> are not the intended recipient. Any views expressed in this message
>> (including any attachments) are those of the individual sender and not
>> those of any member of the YTL Group, except where the message states
>> otherwise and the sender is authorized to state them to be the views of any
>> such entity.*
>>
>>
>>
>

Re: Duplicate Records Showing in Apache Phoenix

Posted by Craig Roberts <cr...@frogasia.com>.
Hi Brian,

Thanks for the response.

I've managed to pin it down a bit more precisely.

An SQL query by "uuid" in the Phoenix sqlline.py script (JDBC):

0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
'c21bcfaf-2a2f-4776-9de6-7e25608baf06';
+------------------------------------------+----------------
--------------------------+-------------------------+-------
-----------------------------------+
|               school_code                |
code                   |        iso_8601         |
uuid                   |
+------------------------------------------+----------------
--------------------------+-------------------------+-------
-----------------------------------+
| WRA0004                                  | site.view
| 2016-02-18 16:00:00.000 | c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
| WRA0004                                  | site.view
| 2016-02-18 16:00:00.000 | c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
+------------------------------------------+----------------
--------------------------+-------------------------+-------
-----------------------------------+

And a HBASE scan for the RowKey:

base(main):001:0> get 'ANALYTICS', "WRA0004\x00site.view\x00\x80\
x00\x01R\xF5\x1C\x04\x00c21bcfaf-2a2f-4776-9de6-7e25608baf06"
COLUMN                                                CELL
 events:_0
timestamp=1477549465155, value=
 events:auth.name
timestamp=1477549465155, value=NURUL296389@Yes.My
 events:auth.role
timestamp=1477549465155, value=User
 events:auth.uuid
timestamp=1477549465155, value=23B1BEAB200055F315DE1F303EA4C0
0141F9704C0A51A527
 events:date
timestamp=1477549465155, value=2016-02-19 00:00:00
 events:migrated
timestamp=1477549465155, value=\x01
 events:site_uuid
timestamp=1477549465155, value=2BAC0082200254FE67D6CF0C5E675D
052A3F53BC00260F67
 events:type
timestamp=1477549465155, value=site
 events:user_uuid
timestamp=1477549465155, value=23B1BEAB200055F315DE1F303EA4C0
0141F9704C0A51A527

We've been checking, and re-checking - but HBase is adamant it has 745,281
rows, while Phoenix is adamant it has 1,527,352...

Is there any situation where Phoenix and HBase row counts will differ? Some
of our records will be similar, but should have different UUIDs. Does
Phoenix/HBase de-duplication occur at any point that might cause this?

Thanks for any help you guys can offer,

*Craig Roberts*
*Senior Developer*

*FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
<http://www.twitter.com/FrogAsia> | Facebook
<http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>

*This message (including any attachments) is for the use of the addressee
only. It may contain private proprietary or legally privileged statements
and information. No confidentiality or privilege is waived or lost by any
mistransmission. If you are not the intended recipient, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print, copy or rely on any part of the message if you
are not the intended recipient. Any views expressed in this message
(including any attachments) are those of the individual sender and not
those of any member of the YTL Group, except where the message states
otherwise and the sender is authorized to state them to be the views of any
such entity.*

On Wed, Oct 26, 2016 at 6:58 PM, Brian Jeltema <bd...@gmail.com> wrote:

> I suspect that both rows exist and that they are different,  but you can’t
> see the difference due to
> some non-printing character. Did you try doing the query using JDBC or a
> direct HBase scan?
>
> Brian
>
> On Oct 26, 2016, at 6:20 AM, Craig Roberts <cr...@frogasia.com>
> wrote:
>
> Hi all,
>
> I have a strange issue with Phoenix/HBase (Phoenix 4.4 and HBase 1.1 in
> HDP 2.4).
>
> This is my table DDL:
>
> CREATE TABLE IF NOT EXISTS ANALYTICS(
>   "school_code" VARCHAR NOT NULL,
>   "code" VARCHAR NOT NULL,
>   "iso_8601" DATE NOT NULL,
>   "uuid" VARCHAR NOT NULL,
>   -- Used to ensure the "events" column family exists
>   "events"."___" VARCHAR,
>   CONSTRAINT id PRIMARY KEY ("school_code", "code", "iso_8601", "uuid")
> )
>
> I then load events into the table using a custom Flume sink.
>
>
> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(*) FROM ANALYTICS;
> +------------------------------------------+
> |                 COUNT(1)                 |
> +------------------------------------------+
> | 1248515                                  |
> +------------------------------------------+
> 1 row selected (2.415 seconds)
> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(DISTINCT("uuid")) FROM
> ANALYTICS;
> +------------------------------------------+
> |          DISTINCT_COUNT("uuid")          |
> +------------------------------------------+
> | 773373                                   |
> +------------------------------------------+
> 1 row selected (5.751 seconds)
>
> Now, this is only showing up with lots of events, and I have almost 1
> million events.
>
> We've managed to pin down an errant event UUID. The primary key is meant
> to ensure that repeated events (Flume's at-least-once guarantees) are
> overwritten, rather than duplicated (due to UPSERT). Here's some example
> queries for the errant UUID:
>
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
> '00011783-c870-463c-9c01-4e530a57714f';
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> |               school_code                |
> code                   |        iso_8601         |
> uuid                   |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> | WRA0004                                  |
> site.view                                | 2016-03-21 16:00:00.000 |
> 00011783-c870-463c-9c01-4e530a57714f     |
>
> | WRA0004                                  |
> site.view                                | 2016-03-21 16:00:00.000 |
> 00011783-c870-463c-9c01-4e530a57714f     |
>
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
>
>
> Which shows two events with what *should* be a duplicate primary key.
>
> If I pin it down by including all the primary key fields, however, I get
> one result:
>
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "school_code"
> = 'WRA0004' AND "code" = 'site.view' AND "iso_8601" = TO_DATE('2016-03-21
> 16:00:00') AND "uuid" = '00011783-c870-463c-9c01-4e530a57714f';
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> |               school_code                |
> code                   |        iso_8601         |
> uuid                   |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> | WRA0004                                  |
> site.view                                | 2016-03-21 16:00:00.000 |
> 00011783-c870-463c-9c01-4e530a57714f     |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
>
> Finally, hbase shell is showing me the correct number of records:
>
> hbase(main):001:0> count 'ANALYTICS'
> # lots of output
> 773376 row(s) in 142.3770 seconds
>
> Now, I'll admit, I would have expected this to be "impossible".
>
> Does anybody have any idea how Phoenix can show me two primary keys with a
> generic query, but only one for a specific query? And how I might fix this?
>
> Thanks in advance,
> *Craig Roberts*
> *Senior Developer*
>
> *FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
> 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
> <http://www.twitter.com/FrogAsia> | Facebook
> <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
>
> *This message (including any attachments) is for the use of the addressee
> only. It may contain private proprietary or legally privileged statements
> and information. No confidentiality or privilege is waived or lost by any
> mistransmission. If you are not the intended recipient, please immediately
> delete it and all copies of it from your system, destroy any hard copies of
> it and notify the sender. You must not, directly or indirectly, use,
> disclose, distribute, print, copy or rely on any part of the message if you
> are not the intended recipient. Any views expressed in this message
> (including any attachments) are those of the individual sender and not
> those of any member of the YTL Group, except where the message states
> otherwise and the sender is authorized to state them to be the views of any
> such entity.*
>
>
>

Re: Duplicate Records Showing in Apache Phoenix

Posted by Brian Jeltema <bd...@gmail.com>.
I suspect that both rows exist and that they are different,  but you can’t see the difference due to
some non-printing character. Did you try doing the query using JDBC or a direct HBase scan?

Brian

> On Oct 26, 2016, at 6:20 AM, Craig Roberts <cr...@frogasia.com> wrote:
> 
> Hi all,
> 
> I have a strange issue with Phoenix/HBase (Phoenix 4.4 and HBase 1.1 in HDP 2.4).
> 
> This is my table DDL:
> 
> CREATE TABLE IF NOT EXISTS ANALYTICS(
>   "school_code" VARCHAR NOT NULL,
>   "code" VARCHAR NOT NULL,
>   "iso_8601" DATE NOT NULL,
>   "uuid" VARCHAR NOT NULL,
>   -- Used to ensure the "events" column family exists
>   "events"."___" VARCHAR,
>   CONSTRAINT id PRIMARY KEY ("school_code", "code", "iso_8601", "uuid")
> )
> 
> I then load events into the table using a custom Flume sink.
> 
> 
> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(*) FROM ANALYTICS;
> +------------------------------------------+
> |                 COUNT(1)                 |
> +------------------------------------------+
> | 1248515                                  |
> +------------------------------------------+
> 1 row selected (2.415 seconds)
> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(DISTINCT("uuid")) FROM ANALYTICS;
> +------------------------------------------+
> |          DISTINCT_COUNT("uuid")          |
> +------------------------------------------+
> | 773373                                   |
> +------------------------------------------+
> 1 row selected (5.751 seconds)
> 
> Now, this is only showing up with lots of events, and I have almost 1 million events.
> 
> We've managed to pin down an errant event UUID. The primary key is meant to ensure that repeated events (Flume's at-least-once guarantees) are overwritten, rather than duplicated (due to UPSERT). Here's some example queries for the errant UUID:
> 
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" = '00011783-c870-463c-9c01-4e530a57714f';
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> |               school_code                |                   code                   |        iso_8601         |                   uuid                   |                   
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> | WRA0004                                  | site.view                                | 2016-03-21 16:00:00.000 | 00011783-c870-463c-9c01-4e530a57714f     |                                          
> | WRA0004                                  | site.view                                | 2016-03-21 16:00:00.000 | 00011783-c870-463c-9c01-4e530a57714f     |                                          
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> 
> 
> Which shows two events with what should be a duplicate primary key.
> 
> If I pin it down by including all the primary key fields, however, I get one result:
> 
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "school_code" = 'WRA0004' AND "code" = 'site.view' AND "iso_8601" = TO_DATE('2016-03-21 16:00:00') AND "uuid" = '00011783-c870-463c-9c01-4e530a57714f';
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> |               school_code                |                   code                   |        iso_8601         |                   uuid                   |
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> | WRA0004                                  | site.view                                | 2016-03-21 16:00:00.000 | 00011783-c870-463c-9c01-4e530a57714f     |
> +------------------------------------------+------------------------------------------+-------------------------+------------------------------------------+
> 
> Finally, hbase shell is showing me the correct number of records:
> 
> hbase(main):001:0> count 'ANALYTICS'
> # lots of output
> 773376 row(s) in 142.3770 seconds
> 
> Now, I'll admit, I would have expected this to be "impossible".
> 
> Does anybody have any idea how Phoenix can show me two primary keys with a generic query, but only one for a specific query? And how I might fix this?
> 
> Thanks in advance,
> Craig Roberts
> Senior Developer
> 
> FrogAsia Sdn Bhd (A YTL Company) | Unit 9, Level 2, D6 at Sentul East | 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter <http://www.twitter.com/FrogAsia> | Facebook <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
> This message (including any attachments) is for the use of the addressee only. It may contain private proprietary or legally privileged statements and information. No confidentiality or privilege is waived or lost by any mistransmission. If you are not the intended recipient, please immediately delete it and all copies of it from your system, destroy any hard copies of it and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, copy or rely on any part of the message if you are not the intended recipient. Any views expressed in this message (including any attachments) are those of the individual sender and not those of any member of the YTL Group, except where the message states otherwise and the sender is authorized to state them to be the views of any such entity.
>