You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2016/01/23 21:02:39 UTC
[jira] [Commented] (PHOENIX-2521) Support duplicate rows in CSV Bulk Loader

    [ https://issues.apache.org/jira/browse/PHOENIX-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113929#comment-15113929 ] 

James Taylor commented on PHOENIX-2521:
---------------------------------------

That's correct, [~moazami.afshin@gmail.com] - the CSV Bulk Loader does not handle the case when there are duplicate rows.

> Support duplicate rows in CSV Bulk Loader
> -----------------------------------------
>
>                 Key: PHOENIX-2521
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2521
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.5.2
>            Reporter: Afshin Moazami
>
>  found out the map reduce csv bulk load tool doesn't behave the same as UPSERTs. Is it by design or a bug?
> Here is the queries for creating table and index:
> {code} CREATE TABLE mySchema.mainTable (
> id varchar NOT NULL,
> name varchar,
> address varchar
> CONSTRAINT pk PRIMARY KEY (id)); {code}
> {code} CREATE INDEX myIndex 
> ON mySchema.mainTable  (name, id) 
> INCLUDE (address); {code}
> if I execute two upserts where the second one update the name (which is the key for index), everything works fine (the record will be updated in both table and index table)
> {code} UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'john', 'Montreal');{code}
> {code}UPSERT INTO mySchema.mainTable (id, name, address) values ('1', 'jack', 'Montreal');{code}
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'jack'; {code}  ==> one record
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'john';  {code}  ==> zero records
> But, if I load the date using org.apache.phoenix.mapreduce.CsvBulkLoadTool to the main table, it behaves different. The main table will be updated, but the new record will be appended to the index table:
> HADOOP_CLASSPATH=/usr/lib/hbase/lib/hbase-protocol-1.1.2.jar:/etc/hbase/conf hadoop jar  /usr/lib/hbase/phoenix-4.5.2-HBase-1.1-bin/phoenix-4.5.2-HBase-1.1-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -d',' -s mySchema -t mainTable -i /tmp/input.txt 
> input.txt:
> 2,tomas,montreal
> 2,george,montreal
> (I have tried it both with/without -it and got the same result)
> {code}SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'tomas' {code} ==> one record;
> {code} SELECT /*+ INDEX(mySchema.mainTable myIndex) */ * from mySchema.mainTable where name = 'george' {code} ==> one record;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)