You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Afshin Moazami <Af...@guavus.com> on 2015/12/11 18:29:44 UTC

Index rows are not updated when the index key updated using bulk loader

Hi,

I guess it is a bug, but before posting in Jira, I want to see if anyone has an idea here.

I found out the map reduce csv bulk load tool doesn't behave the same as UPSERTs. Is it by design or a bug?

Here is the queries for creating table and index:

CREATE TABLE mySchema.mainTable (
id varchar NOT NULL,
name varchar,
address varchar
CONSTRAINT pk PRIMARY KEY (id));


CREATE INDEX myIndex
ON mySchema.mainTable  (name, id)
INCLUDE (address);

if I execute two upserts where the second one update the name (which is the key for index), everything works fine (the record will be updated in both table and index table)

UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'john', 'Montreal');
UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack', 'Montreal');

SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'jack';   ==> one record
SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'john';   ==> zero records

But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool to the main table, it behaves different. The main table will be updated, but the new record will be appended to the index table:

HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf hadoop jar  /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable -i /tmp/input.txt

input.txt:

2,tomas,montreal

2,george,montreal

(I have tried it both with/without -it and got the same result)


SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'tomas' ==> one record;


SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'george' ==> one record;

Re: Index rows are not updated when the index key updated using bulk loader

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Afshin,

That looks like a bug to me, although I'm not too confident about
coming up with a good fix for it.

There isn't any handling in the bulk load tool for multiple updates to
the same row in a single input. Basically, the code assumes that a
single given row is only included once in any given set of input. It
might be possible to do an initial filtering on the input to check for
multiple updates to the same row, but even that would be tricky, as
there is no total ordered handling of input.

This is something that should at least be documented, and it might be
possible to at least detect this kind of situation (even if it can't
be handled "correctly"). Could you log a jira ticket for this?

- Gabriel


On Fri, Dec 11, 2015 at 6:29 PM, Afshin Moazami
<Af...@guavus.com> wrote:
> Hi,
>
> I guess it is a bug, but before posting in Jira, I want to see if anyone has
> an idea here.
>
> I found out the map reduce csv bulk load tool doesn't behave the same as
> UPSERTs. Is it by design or a bug?
>
> Here is the queries for creating table and index:
>
> CREATE TABLE mySchema.mainTable (
> id varchar NOT NULL,
> name varchar,
> address varchar
> CONSTRAINT pk PRIMARY KEY (id));
>
>
> CREATE INDEX myIndex
> ON mySchema.mainTable  (name, id)
> INCLUDE (address);
>
> if I execute two upserts where the second one update the name (which is the
> key for index), everything works fine (the record will be updated in both
> table and index table)
>
> UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'john',
> 'Montreal');
> UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack',
> 'Montreal');
>
> SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable
> where name = 'jack';   ==> one record
> SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable
> where name = 'john';   ==> zero records
>
> But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool
> to the main table, it behaves different. The main table will be updated, but
> the new record will be appended to the index table:
>
> HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf
> hadoop jar
> /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar
> org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable
> -i /tmp/input.txt
>
> input.txt:
>
> 2,tomas,montreal
>
> 2,george,montreal
>
>
> (I have tried it both with/without -it and got the same result)
>
> SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable
> where name = 'tomas' ==> one record;
>
>
> SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable
> where name = 'george' ==> one record;