You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by yixu2001 <yi...@163.com> on 2017/10/18 07:14:21 UTC

Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

dev 


In carbondata version 1.2.0, I execute "update" statement with sub-query, it failed.
All the rows in the 2 tables are not duplicated, and the same statement will succeed in carbondata version 1.1.1.

The test log as following:
scala> cc.sql("select count(*), count(distinct id) from qqdata2.c_indextest1").show(100,false);
+--------+------------------+
|count(1)|count(DISTINCT id)|
+--------+------------------+
|300000  |300000            |
+--------+------------------+

scala> cc.sql("select count(*), count(distinct id) from qqdata2.c_indextest2").show(100,false);
+--------+------------------+
|count(1)|count(DISTINCT id)|
+--------+------------------+
|71223220|71223220          |
+--------+------------------+

scala> cc.sql("update qqdata2.c_indextest2 a set(a.CUST_ORDER_ID,a.ORDER_ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID) = (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.SHARDING_ID,b.ORDER_ATTR_ID from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
17/10/18 11:32:46 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn 

][bigdata][Thread-1]Delete data operation is failed for qqdata2.c_indextest2
17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is failed due to failure in creating delete delta file for segment : null block : null
17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in update operationjava.lang.Exception: Multiple input rows matched for same row.
java.lang.RuntimeException: Update operation failed. Multiple input rows matched for same row.
  at scala.sys.package$.error(package.scala:27)
  at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.processData(IUDCommands.scala:239)
  at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(IUDCommands.scala:141)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeTake(commands.scala:71)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2378)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
  at org.apache.spark.sql.Dataset.org 

$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset.org 

$apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
  ... 50 elided
 


yixu2001

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
Will the patch be released?


yixu2001
 
From: sounak
Date: 2017-11-07 20:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Hi,
 
This fix is still under development. Hopefully we will be able to give a
proper solution by this week end.
 
 
 
On Tue, Nov 7, 2017 at 1:24 PM, yixu2001 <yi...@163.com> wrote:
 
> dev
> What time is expected to release a patch, we can test
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-11-07 15:47
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>  hello there is no solution, the next version carbondata 1.2.1. can be
> released?
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-10-27 16:55
> To: dev
> CC: sounak; chenliang6136
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>
> thank you
>  We hope it will be resolved in a new release version carbondata 1.2.1.
> In addition, the problem about maillist "Is query slowing down due to the
> fragmentations produced by the many update operations?" is also a very
> important problem in our selection process. We will be grateful if it can
> also be resolved in carbondata 1.2.1.
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-10-21 11:12
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>
> testdata
> 链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
> CSV
>
>
> yixu2001
> From: Raghunandan S
> Date: 2017-10-20 16:56
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Should be fine.do we need to create any login to access it?
> On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
> > dev
> >        Baidu SkyDrive share it ?
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 16:14
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > As the tupleIds are also unique, the only option left is to get hold of
> > your data and reproducing it in-house. Is it possible to generate your
> data
> > in-house or get it?
> >
> > On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
> >
> > > dev
> > >
> > > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id
> ,
> > > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> > b.id)
> > > c group by c.tupleId having count(*)>1").show;
> > > +-------+--------+
> > > |tupleId|count(1)|
> > > +-------+--------+
> > > +-------+--------+
> > >
> > > no Multiple input rows
> > >
> > >
> > > yixu2001
> > >
> > > From: sounak
> > > Date: 2017-10-20 10:57
> > > To: dev
> > > Subject: Re: Re: Update statement failed with "Multiple input rows
> > matched
> > > for same row" in version 1.2.0,
> > > Slight rectification in typo of the query syntax
> > >
> > > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > > c_indextest2
> > > b where a.id=b.id;"
> > >
> > > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> > >
> > > > Internally, we call a UDF to generate the TupleId and based on that
> > > > tupleId we decide if the row is duplicated or not. You can run a
> > slightly
> > > > tweaked query
> > > >
> > > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > > c_indextest2
> > > > b where a.id=b.id; "
> > > >
> > > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > > will
> > > > give duplicate entries.
> > > >
> > > > In between your given query runs fine with the data i have generated
> in
> > > > house. So this is data dependent. So it will be good if we get
> > > > *origin_data* or the script which has build the data for origin_data.
> > > >
> > > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > > >
> > > >> dev
> > > >>    one row record
> > > >>
> > > >>
> > > >> yixu2001
> > > >>
> > > >> From: Liang Chen
> > > >> Date: 2017-10-19 22:26
> > > >> To: dev
> > > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > > >> matched for same row" in version 1.2.0,
> > > >> Hi
> > > >>
> > > >> Execute the below query, return one row record or multiple row
> > records ?
> > > >> ---------------------------------
> > > >> select a.remark from  c_indextest1 a where a.id=b.id
> > > >>
> > > >> Regards
> > > >> Liang
> > > >>
> > > >>
> > > >> yixu2001 wrote
> > > >> > dev
> > > >> >  You can follow the steps below to reproduce the problem.
> > > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> > about
> > > >> 30w
> > > >> > records.
> > > >> >
> > > >> > step 1:
> > > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > > >> string)
> > > >> > STORED BY 'carbondata'").show;
> > > >> >
> > > >> > step 2: origin_data is a existing table with 1700w records, and
> > fields
> > > >> in
> > > >> > the table do not matter.
> > > >> > cc.sql("insert into c_indextest2 select row_number()
> over(partition
> > by
> > > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > > >> remark
> > > >> > from origin_data limit 17000000) a").show;
> > > >> >
> > > >> > step 3:
> > > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > > >> string)
> > > >> > STORED BY 'carbondata'").show;
> > > >> >
> > > >> > step 4:
> > > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > > >> > pmod(cast(ID as int),50)=43").show;
> > > >> >
> > > >> > setp 5:
> > > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > > >> > c_indextest1 a where a.id=b.id)").show;
> > > >> >
> > > >> >
> > > >> > yixu2001
> > > >> >
> > > >> > From: sounak
> > > >> > Date: 2017-10-19 18:26
> > > >> > To: dev
> > > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > > >> matched
> > > >> > for same row" in version 1.2.0,
> > > >> > Hi Yixu,
> > > >> >
> > > >> > Thanks for the ddl. But i am particularly interested on the data
> in
> > > >> order
> > > >> > to reproduce the problem. Will it be feasible to share to data or
> if
> > > it
> > > >> is
> > > >> > huge then the scripts that generates it? I cannot access the data
> > > >> location
> > > >> > from public network.
> > > >> >
> > > >> >
> > > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > > >>
> > > >> > yixu2001@
> > > >>
> > > >> > &gt; wrote:
> > > >> >
> > > >> >> dev
> > > >> >>
> > > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > > >> >> `id` INT,
> > > >> >> `CUST_ORDER_ID` STRING,
> > > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > > >> >> `CREATE_DATE` STRING,
> > > >> >> `UPDATE_DATE` STRING,
> > > >> >> `STATUS_CD` STRING,
> > > >> >> `STATUS_DATE` STRING,
> > > >> >> `AREA_ID` STRING,
> > > >> >> `REGION_CD` STRING,
> > > >> >> `UPDATE_STAFF` STRING,
> > > >> >> `CREATE_STAFF` STRING,
> > > >> >> `SHARDING_ID` STRING,
> > > >> >> `ORDER_ATTR_ID` STRING )
> > > >> >> ROW FORMAT DELIMITED
> > > >> >>   FIELDS TERMINATED BY '\u0001'
> > > >> >>   LINES TERMINATED BY '\n'
> > > >> >>   NULL DEFINED AS ''
> > > >> >> STORED AS INPUTFORMAT
> > > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > > >> >> OUTPUTFORMAT
> > > >> >>   'org.apache.hadoop.hive.ql.io
> > > >> >>
> > > >> >> .HiveIgnoreKeyTextOutputFormat'
> > > >> >> LOCATION
> > > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > > >> >>
> > > >> >> ';
> > > >> >>
> > > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > > data
> > > >> >> format in the csv file just as following
> > > >> >>
> > > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > > >> 302064,127859875,
> > > >> >> 9999999
> > > >> >>
> > > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id
> STRING,
> > > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > > >> STRING,STATUS_DATE
> > > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > > STRING,CREATE_STAFF
> > > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > > 'carbondata'
> > > >> ")
> > > >> >>
> > > >> >> Step 3:Insert data:
> > > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > > >> >> qqdata2.h_indextest1").show(100,false);
> > > >> >>
> > > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > > >> >> qqdata2.c_indextest2
> > > >> >> The record number of qqdata2.c_indextest1 is 30w, the record
> number
> > > of
> > > >> >> qqdata2.c_indextest2 is 1700w.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> yixu2001
> > > >> >>
> > > >> >> From: sounak
> > > >> >> Date: 2017-10-19 14:13
> > > >> >> To: dev
> > > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > > matched
> > > >> >> for
> > > >> >> same row" in version 1.2.0,
> > > >> >> Hi Yixu,
> > > >> >>
> > > >> >> Can you please share the DDLs and the Data for the above problem
> > with
> > > >> us?
> > > >> >>
> > > >> >> Thanks
> > > >> >> Sounak
> > > >> >>
> > > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > > >>
> > > >> > yixu2001@
> > > >>
> > > >> > &gt; wrote:
> > > >> >>
> > > >> >> > dev
> > > >> >> >
> > > >> >> >
> > > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > > >> >> sub-query,
> > > >> >> > it failed.
> > > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > > >> statement
> > > >> >> > will succeed in carbondata version 1.1.1.
> > > >> >> >
> > > >> >> > The test log as following:
> > > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > > >> >> > qqdata2.c_indextest1").show(100,false);
> > > >> >> > +--------+------------------+
> > > >> >> > |count(1)|count(DISTINCT id)|
> > > >> >> > +--------+------------------+
> > > >> >> > |300000  |300000            |
> > > >> >> > +--------+------------------+
> > > >> >> >
> > > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > > >> >> > qqdata2.c_indextest2").show(100,false);
> > > >> >> > +--------+------------------+
> > > >> >> > |count(1)|count(DISTINCT id)|
> > > >> >> > +--------+------------------+
> > > >> >> > |71223220|71223220          |
> > > >> >> > +--------+------------------+
> > > >> >> >
> > > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > > >> R_ATTR_ID)
> > > >> >> =
> > > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> > )").show(100,false);
> > > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string
> representation
> > > of
> > > >> a
> > > >> >> > plan since it was too large. This behavior can be adjusted by
> > > setting
> > > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > > >> >> >
> > > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > > >> >> > qqdata2.c_indextest2
> > > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > > operation
> > > >> is
> > > >> >> > failed due to failure in creating delete delta file for
> segment :
> > > >> null
> > > >> >> > block : null
> > > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main
> Exception
> > in
> > > >> >> update
> > > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> > same
> > > >> row.
> > > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> > input
> > > >> >> rows
> > > >> >> > matched for same row.
> > > >> >> >   at scala.sys.package$.error(package.scala:27)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> > > ProjectForUpdateCommand.
> > > >> >> > processData(IUDCommands.scala:239)
> > > >> >> >   at
> > > >> >> org.apache.spark.sql.execution.command.
> ProjectForUpdateCommand.run(
> > > >> >> > IUDCommands.scala:141)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > sideEffectResult(commands.scala:56)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > executeTake(commands.scala:71)
> > > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > > >> >> > executeCollect(limit.scala:38)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$org$apache$spark$sql$
> > > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.
> withNewExecutio
> > > >> nId(
> > > >> >> > SQLExecution.scala:57)
> > > >> >> >   at
> > > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.
> scala:2780)
> > > >> >> >   at org.apache.spark.sql.Dataset.org
> > > >> >> >
> > > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > > >> >> >   at org.apache.spark.sql.Dataset.org
> > > >> >> >
> > > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$head$1.apply(Dataset.
> > > >> >> > scala:2120)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$head$1.apply(Dataset.
> > > >> >> > scala:2119)
> > > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.
> scala
> > > >> :2810)
> > > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > > >> >> >   at org.apache.spark.sql.Dataset.
> showString(Dataset.scala:248)
> > > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > > >> >> >   ... 50 elided
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > yixu2001
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Thanks
> > > >> >> Sounak
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks
> > > >> > Sounak
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > > >> .nabble.com/
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks
> > > > Sounak
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
 
 
 
-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

Hi,

This fix is still under development. Hopefully we will be able to give a
proper solution by this week end.



On Tue, Nov 7, 2017 at 1:24 PM, yixu2001 <yi...@163.com> wrote:

> dev
> What time is expected to release a patch, we can test
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-11-07 15:47
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>  hello there is no solution, the next version carbondata 1.2.1. can be
> released?
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-10-27 16:55
> To: dev
> CC: sounak; chenliang6136
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>
> thank you
>  We hope it will be resolved in a new release version carbondata 1.2.1.
> In addition, the problem about maillist "Is query slowing down due to the
> fragmentations produced by the many update operations?" is also a very
> important problem in our selection process. We will be grateful if it can
> also be resolved in carbondata 1.2.1.
>
>
> yixu2001
>
> From: yixu2001
> Date: 2017-10-21 11:12
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> dev
>
> testdata
> 链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
> CSV
>
>
> yixu2001
> From: Raghunandan S
> Date: 2017-10-20 16:56
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Should be fine.do we need to create any login to access it?
> On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
> > dev
> >        Baidu SkyDrive share it ?
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 16:14
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > As the tupleIds are also unique, the only option left is to get hold of
> > your data and reproducing it in-house. Is it possible to generate your
> data
> > in-house or get it?
> >
> > On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
> >
> > > dev
> > >
> > > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id
> ,
> > > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> > b.id)
> > > c group by c.tupleId having count(*)>1").show;
> > > +-------+--------+
> > > |tupleId|count(1)|
> > > +-------+--------+
> > > +-------+--------+
> > >
> > > no Multiple input rows
> > >
> > >
> > > yixu2001
> > >
> > > From: sounak
> > > Date: 2017-10-20 10:57
> > > To: dev
> > > Subject: Re: Re: Update statement failed with "Multiple input rows
> > matched
> > > for same row" in version 1.2.0,
> > > Slight rectification in typo of the query syntax
> > >
> > > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > > c_indextest2
> > > b where a.id=b.id;"
> > >
> > > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> > >
> > > > Internally, we call a UDF to generate the TupleId and based on that
> > > > tupleId we decide if the row is duplicated or not. You can run a
> > slightly
> > > > tweaked query
> > > >
> > > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > > c_indextest2
> > > > b where a.id=b.id; "
> > > >
> > > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > > will
> > > > give duplicate entries.
> > > >
> > > > In between your given query runs fine with the data i have generated
> in
> > > > house. So this is data dependent. So it will be good if we get
> > > > *origin_data* or the script which has build the data for origin_data.
> > > >
> > > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > > >
> > > >> dev
> > > >>    one row record
> > > >>
> > > >>
> > > >> yixu2001
> > > >>
> > > >> From: Liang Chen
> > > >> Date: 2017-10-19 22:26
> > > >> To: dev
> > > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > > >> matched for same row" in version 1.2.0,
> > > >> Hi
> > > >>
> > > >> Execute the below query, return one row record or multiple row
> > records ?
> > > >> ---------------------------------
> > > >> select a.remark from  c_indextest1 a where a.id=b.id
> > > >>
> > > >> Regards
> > > >> Liang
> > > >>
> > > >>
> > > >> yixu2001 wrote
> > > >> > dev
> > > >> >  You can follow the steps below to reproduce the problem.
> > > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> > about
> > > >> 30w
> > > >> > records.
> > > >> >
> > > >> > step 1:
> > > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > > >> string)
> > > >> > STORED BY 'carbondata'").show;
> > > >> >
> > > >> > step 2: origin_data is a existing table with 1700w records, and
> > fields
> > > >> in
> > > >> > the table do not matter.
> > > >> > cc.sql("insert into c_indextest2 select row_number()
> over(partition
> > by
> > > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > > >> remark
> > > >> > from origin_data limit 17000000) a").show;
> > > >> >
> > > >> > step 3:
> > > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > > >> string)
> > > >> > STORED BY 'carbondata'").show;
> > > >> >
> > > >> > step 4:
> > > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > > >> > pmod(cast(ID as int),50)=43").show;
> > > >> >
> > > >> > setp 5:
> > > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > > >> > c_indextest1 a where a.id=b.id)").show;
> > > >> >
> > > >> >
> > > >> > yixu2001
> > > >> >
> > > >> > From: sounak
> > > >> > Date: 2017-10-19 18:26
> > > >> > To: dev
> > > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > > >> matched
> > > >> > for same row" in version 1.2.0,
> > > >> > Hi Yixu,
> > > >> >
> > > >> > Thanks for the ddl. But i am particularly interested on the data
> in
> > > >> order
> > > >> > to reproduce the problem. Will it be feasible to share to data or
> if
> > > it
> > > >> is
> > > >> > huge then the scripts that generates it? I cannot access the data
> > > >> location
> > > >> > from public network.
> > > >> >
> > > >> >
> > > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > > >>
> > > >> > yixu2001@
> > > >>
> > > >> > &gt; wrote:
> > > >> >
> > > >> >> dev
> > > >> >>
> > > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > > >> >> `id` INT,
> > > >> >> `CUST_ORDER_ID` STRING,
> > > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > > >> >> `CREATE_DATE` STRING,
> > > >> >> `UPDATE_DATE` STRING,
> > > >> >> `STATUS_CD` STRING,
> > > >> >> `STATUS_DATE` STRING,
> > > >> >> `AREA_ID` STRING,
> > > >> >> `REGION_CD` STRING,
> > > >> >> `UPDATE_STAFF` STRING,
> > > >> >> `CREATE_STAFF` STRING,
> > > >> >> `SHARDING_ID` STRING,
> > > >> >> `ORDER_ATTR_ID` STRING )
> > > >> >> ROW FORMAT DELIMITED
> > > >> >>   FIELDS TERMINATED BY '\u0001'
> > > >> >>   LINES TERMINATED BY '\n'
> > > >> >>   NULL DEFINED AS ''
> > > >> >> STORED AS INPUTFORMAT
> > > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > > >> >> OUTPUTFORMAT
> > > >> >>   'org.apache.hadoop.hive.ql.io
> > > >> >>
> > > >> >> .HiveIgnoreKeyTextOutputFormat'
> > > >> >> LOCATION
> > > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > > >> >>
> > > >> >> ';
> > > >> >>
> > > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > > data
> > > >> >> format in the csv file just as following
> > > >> >>
> > > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > > >> 302064,127859875,
> > > >> >> 9999999
> > > >> >>
> > > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id
> STRING,
> > > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > > >> STRING,STATUS_DATE
> > > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > > STRING,CREATE_STAFF
> > > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > > 'carbondata'
> > > >> ")
> > > >> >>
> > > >> >> Step 3:Insert data:
> > > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > > >> >> qqdata2.h_indextest1").show(100,false);
> > > >> >>
> > > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > > >> >> qqdata2.c_indextest2
> > > >> >> The record number of qqdata2.c_indextest1 is 30w, the record
> number
> > > of
> > > >> >> qqdata2.c_indextest2 is 1700w.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> yixu2001
> > > >> >>
> > > >> >> From: sounak
> > > >> >> Date: 2017-10-19 14:13
> > > >> >> To: dev
> > > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > > matched
> > > >> >> for
> > > >> >> same row" in version 1.2.0,
> > > >> >> Hi Yixu,
> > > >> >>
> > > >> >> Can you please share the DDLs and the Data for the above problem
> > with
> > > >> us?
> > > >> >>
> > > >> >> Thanks
> > > >> >> Sounak
> > > >> >>
> > > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > > >>
> > > >> > yixu2001@
> > > >>
> > > >> > &gt; wrote:
> > > >> >>
> > > >> >> > dev
> > > >> >> >
> > > >> >> >
> > > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > > >> >> sub-query,
> > > >> >> > it failed.
> > > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > > >> statement
> > > >> >> > will succeed in carbondata version 1.1.1.
> > > >> >> >
> > > >> >> > The test log as following:
> > > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > > >> >> > qqdata2.c_indextest1").show(100,false);
> > > >> >> > +--------+------------------+
> > > >> >> > |count(1)|count(DISTINCT id)|
> > > >> >> > +--------+------------------+
> > > >> >> > |300000  |300000            |
> > > >> >> > +--------+------------------+
> > > >> >> >
> > > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > > >> >> > qqdata2.c_indextest2").show(100,false);
> > > >> >> > +--------+------------------+
> > > >> >> > |count(1)|count(DISTINCT id)|
> > > >> >> > +--------+------------------+
> > > >> >> > |71223220|71223220          |
> > > >> >> > +--------+------------------+
> > > >> >> >
> > > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > > >> R_ATTR_ID)
> > > >> >> =
> > > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> > )").show(100,false);
> > > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string
> representation
> > > of
> > > >> a
> > > >> >> > plan since it was too large. This behavior can be adjusted by
> > > setting
> > > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > > >> >> >
> > > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > > >> >> > qqdata2.c_indextest2
> > > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > > operation
> > > >> is
> > > >> >> > failed due to failure in creating delete delta file for
> segment :
> > > >> null
> > > >> >> > block : null
> > > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main
> Exception
> > in
> > > >> >> update
> > > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> > same
> > > >> row.
> > > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> > input
> > > >> >> rows
> > > >> >> > matched for same row.
> > > >> >> >   at scala.sys.package$.error(package.scala:27)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> > > ProjectForUpdateCommand.
> > > >> >> > processData(IUDCommands.scala:239)
> > > >> >> >   at
> > > >> >> org.apache.spark.sql.execution.command.
> ProjectForUpdateCommand.run(
> > > >> >> > IUDCommands.scala:141)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > sideEffectResult(commands.scala:56)
> > > >> >> >   at org.apache.spark.sql.execution.command.
> ExecutedCommandExec.
> > > >> >> > executeTake(commands.scala:71)
> > > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > > >> >> > executeCollect(limit.scala:38)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$org$apache$spark$sql$
> > > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.
> withNewExecutio
> > > >> nId(
> > > >> >> > SQLExecution.scala:57)
> > > >> >> >   at
> > > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.
> scala:2780)
> > > >> >> >   at org.apache.spark.sql.Dataset.org
> > > >> >> >
> > > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > > >> >> >   at org.apache.spark.sql.Dataset.org
> > > >> >> >
> > > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$head$1.apply(Dataset.
> > > >> >> > scala:2120)
> > > >> >> >   at org.apache.spark.sql.Dataset$$
> anonfun$head$1.apply(Dataset.
> > > >> >> > scala:2119)
> > > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.
> scala
> > > >> :2810)
> > > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > > >> >> >   at org.apache.spark.sql.Dataset.
> showString(Dataset.scala:248)
> > > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > > >> >> >   ... 50 elided
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > yixu2001
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Thanks
> > > >> >> Sounak
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks
> > > >> > Sounak
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > > >> .nabble.com/
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks
> > > > Sounak
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>



-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
What time is expected to release a patch, we can test


yixu2001
 
From: yixu2001
Date: 2017-11-07 15:47
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 hello there is no solution, the next version carbondata 1.2.1. can be released?


yixu2001
 
From: yixu2001
Date: 2017-10-27 16:55
To: dev
CC: sounak; chenliang6136
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 
thank you
 We hope it will be resolved in a new release version carbondata 1.2.1.
In addition, the problem about maillist "Is query slowing down due to the fragmentations produced by the many update operations?" is also a very important problem in our selection process. We will be grateful if it can also be resolved in carbondata 1.2.1.


yixu2001
 
From: yixu2001
Date: 2017-10-21 11:12
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 
testdata
链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
CSV
 
 
yixu2001
From: Raghunandan S
Date: 2017-10-20 16:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
 hello there is no solution, the next version carbondata 1.2.1. can be released?


yixu2001
 
From: yixu2001
Date: 2017-10-27 16:55
To: dev
CC: sounak; chenliang6136
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 
thank you
 We hope it will be resolved in a new release version carbondata 1.2.1.
In addition, the problem about maillist "Is query slowing down due to the fragmentations produced by the many update operations?" is also a very important problem in our selection process. We will be grateful if it can also be resolved in carbondata 1.2.1.


yixu2001
 
From: yixu2001
Date: 2017-10-21 11:12
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 
testdata
链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
CSV
 
 
yixu2001
From: Raghunandan S
Date: 2017-10-20 16:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
 
thank you
 We hope it will be resolved in a new release version carbondata 1.2.1.
In addition, the problem about maillist "Is query slowing down due to the fragmentations produced by the many update operations?" is also a very important problem in our selection process. We will be grateful if it can also be resolved in carbondata 1.2.1.


yixu2001
 
From: yixu2001
Date: 2017-10-21 11:12
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
dev 
 
testdata
链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
CSV
 
 
yixu2001
From: Raghunandan S
Date: 2017-10-20 16:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 

testdata
链接：http://pan.baidu.com/s/1boVAqeF 密码：qtdh
 CSV


yixu2001
 
From: Raghunandan S
Date: 2017-10-20 16:56
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:
 
> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by Raghunandan S <ca...@gmail.com>.

Should be fine.do we need to create any login to access it?
On Fri, 20 Oct 2017 at 1:31 PM, yixu2001 <yi...@163.com> wrote:

> dev
>        Baidu SkyDrive share it ?
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 16:14
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> As the tupleIds are also unique, the only option left is to get hold of
> your data and reproducing it in-house. Is it possible to generate your data
> in-house or get it?
>
> On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> > scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> > getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=
> b.id)
> > c group by c.tupleId having count(*)>1").show;
> > +-------+--------+
> > |tupleId|count(1)|
> > +-------+--------+
> > +-------+--------+
> >
> > no Multiple input rows
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-20 10:57
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Slight rectification in typo of the query syntax
> >
> > " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> > c_indextest2
> > b where a.id=b.id;"
> >
> > On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
> >
> > > Internally, we call a UDF to generate the TupleId and based on that
> > > tupleId we decide if the row is duplicated or not. You can run a
> slightly
> > > tweaked query
> > >
> > > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> > c_indextest2
> > > b where a.id=b.id; "
> > >
> > > In order to prompt Multiple Input Rows most probably the getTupleId()
> > will
> > > give duplicate entries.
> > >
> > > In between your given query runs fine with the data i have generated in
> > > house. So this is data dependent. So it will be good if we get
> > > *origin_data* or the script which has build the data for origin_data.
> > >
> > > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> > >
> > >> dev
> > >>    one row record
> > >>
> > >>
> > >> yixu2001
> > >>
> > >> From: Liang Chen
> > >> Date: 2017-10-19 22:26
> > >> To: dev
> > >> Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched for same row" in version 1.2.0,
> > >> Hi
> > >>
> > >> Execute the below query, return one row record or multiple row
> records ?
> > >> ---------------------------------
> > >> select a.remark from  c_indextest1 a where a.id=b.id
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> yixu2001 wrote
> > >> > dev
> > >> >  You can follow the steps below to reproduce the problem.
> > >> > tables c_indextest2 has 1700w records and table c_indextest1 has
> about
> > >> 30w
> > >> > records.
> > >> >
> > >> > step 1:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 2: origin_data is a existing table with 1700w records, and
> fields
> > >> in
> > >> > the table do not matter.
> > >> > cc.sql("insert into c_indextest2 select row_number() over(partition
> by
> > >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> > >> remark
> > >> > from origin_data limit 17000000) a").show;
> > >> >
> > >> > step 3:
> > >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> > >> string)
> > >> > STORED BY 'carbondata'").show;
> > >> >
> > >> > step 4:
> > >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > >> > pmod(cast(ID as int),50)=43").show;
> > >> >
> > >> > setp 5:
> > >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > >> > c_indextest1 a where a.id=b.id)").show;
> > >> >
> > >> >
> > >> > yixu2001
> > >> >
> > >> > From: sounak
> > >> > Date: 2017-10-19 18:26
> > >> > To: dev
> > >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> > >> matched
> > >> > for same row" in version 1.2.0,
> > >> > Hi Yixu,
> > >> >
> > >> > Thanks for the ddl. But i am particularly interested on the data in
> > >> order
> > >> > to reproduce the problem. Will it be feasible to share to data or if
> > it
> > >> is
> > >> > huge then the scripts that generates it? I cannot access the data
> > >> location
> > >> > from public network.
> > >> >
> > >> >
> > >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >
> > >> >> dev
> > >> >>
> > >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> > >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> > >> >> `id` INT,
> > >> >> `CUST_ORDER_ID` STRING,
> > >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> > >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> > >> >> `CREATE_DATE` STRING,
> > >> >> `UPDATE_DATE` STRING,
> > >> >> `STATUS_CD` STRING,
> > >> >> `STATUS_DATE` STRING,
> > >> >> `AREA_ID` STRING,
> > >> >> `REGION_CD` STRING,
> > >> >> `UPDATE_STAFF` STRING,
> > >> >> `CREATE_STAFF` STRING,
> > >> >> `SHARDING_ID` STRING,
> > >> >> `ORDER_ATTR_ID` STRING )
> > >> >> ROW FORMAT DELIMITED
> > >> >>   FIELDS TERMINATED BY '\u0001'
> > >> >>   LINES TERMINATED BY '\n'
> > >> >>   NULL DEFINED AS ''
> > >> >> STORED AS INPUTFORMAT
> > >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> > >> >> OUTPUTFORMAT
> > >> >>   'org.apache.hadoop.hive.ql.io
> > >> >>
> > >> >> .HiveIgnoreKeyTextOutputFormat'
> > >> >> LOCATION
> > >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> > >> >>
> > >> >> ';
> > >> >>
> > >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> > data
> > >> >> format in the csv file just as following
> > >> >>
> > >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> > >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> > >> 302064,127859875,
> > >> >> 9999999
> > >> >>
> > >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> > >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> > >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> > >> STRING,ATTR_VALUE_IDATTR_VALUE
> > >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> > >> STRING,STATUS_DATE
> > >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> > STRING,CREATE_STAFF
> > >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> > 'carbondata'
> > >> ")
> > >> >>
> > >> >> Step 3:Insert data:
> > >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> > >> >> qqdata2.h_indextest1").show(100,false);
> > >> >>
> > >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> > >> >> qqdata2.c_indextest2
> > >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> > of
> > >> >> qqdata2.c_indextest2 is 1700w.
> > >> >>
> > >> >>
> > >> >>
> > >> >> yixu2001
> > >> >>
> > >> >> From: sounak
> > >> >> Date: 2017-10-19 14:13
> > >> >> To: dev
> > >> >> Subject: Re: Update statement failed with "Multiple input rows
> > matched
> > >> >> for
> > >> >> same row" in version 1.2.0,
> > >> >> Hi Yixu,
> > >> >>
> > >> >> Can you please share the DDLs and the Data for the above problem
> with
> > >> us?
> > >> >>
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> > >>
> > >> > yixu2001@
> > >>
> > >> > &gt; wrote:
> > >> >>
> > >> >> > dev
> > >> >> >
> > >> >> >
> > >> >> > In carbondata version 1.2.0, I execute "update" statement with
> > >> >> sub-query,
> > >> >> > it failed.
> > >> >> > All the rows in the 2 tables are not duplicated, and the same
> > >> statement
> > >> >> > will succeed in carbondata version 1.1.1.
> > >> >> >
> > >> >> > The test log as following:
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest1").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |300000  |300000            |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("select count(*), count(distinct id) from
> > >> >> > qqdata2.c_indextest2").show(100,false);
> > >> >> > +--------+------------------+
> > >> >> > |count(1)|count(DISTINCT id)|
> > >> >> > +--------+------------------+
> > >> >> > |71223220|71223220          |
> > >> >> > +--------+------------------+
> > >> >> >
> > >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> > >> >> set(a.CUST_ORDER_ID,a.ORDER_
> > >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> > >> R_ATTR_ID)
> > >> >> =
> > >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> > IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> > >> >> SHARDING_ID,b.ORDER_ATTR_ID
> > >> >> > from qqdata2.c_indextest1 b where a.id = b.id
> )").show(100,false);
> > >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> > of
> > >> a
> > >> >> > plan since it was too large. This behavior can be adjusted by
> > setting
> > >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> > >> >> >
> > >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> > >> >> > qqdata2.c_indextest2
> > >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> > operation
> > >> is
> > >> >> > failed due to failure in creating delete delta file for segment :
> > >> null
> > >> >> > block : null
> > >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception
> in
> > >> >> update
> > >> >> > operationjava.lang.Exception: Multiple input rows matched for
> same
> > >> row.
> > >> >> > java.lang.RuntimeException: Update operation failed. Multiple
> input
> > >> >> rows
> > >> >> > matched for same row.
> > >> >> >   at scala.sys.package$.error(package.scala:27)
> > >> >> >   at org.apache.spark.sql.execution.command.
> > ProjectForUpdateCommand.
> > >> >> > processData(IUDCommands.scala:239)
> > >> >> >   at
> > >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > >> >> > IUDCommands.scala:141)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult$lzycompute(commands.scala:58)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > sideEffectResult(commands.scala:56)
> > >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > >> >> > executeTake(commands.scala:71)
> > >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > >> >> > executeCollect(limit.scala:38)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> > >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> > >> nId(
> > >> >> > SQLExecution.scala:57)
> > >> >> >   at
> > >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> > >> >> >   at org.apache.spark.sql.Dataset.org
> > >> >> >
> > >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2120)
> > >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > >> >> > scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> > >> :2810)
> > >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> > >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> > >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> > >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> > >> >> >   ... 50 elided
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > yixu2001
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Thanks
> > >> >> Sounak
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks
> > >> > Sounak
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> > >> .nabble.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Sounak
> > >
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
       Baidu SkyDrive share it ?


yixu2001
 
From: sounak
Date: 2017-10-20 16:14
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
As the tupleIds are also unique, the only option left is to get hold of
your data and reproducing it in-house. Is it possible to generate your data
in-house or get it?
 
On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:
 
> dev
>
> scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=b.id)
> c group by c.tupleId having count(*)>1").show;
> +-------+--------+
> |tupleId|count(1)|
> +-------+--------+
> +-------+--------+
>
> no Multiple input rows
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 10:57
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Slight rectification in typo of the query syntax
>
> " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> c_indextest2
> b where a.id=b.id;"
>
> On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
>
> > Internally, we call a UDF to generate the TupleId and based on that
> > tupleId we decide if the row is duplicated or not. You can run a slightly
> > tweaked query
> >
> > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> c_indextest2
> > b where a.id=b.id; "
> >
> > In order to prompt Multiple Input Rows most probably the getTupleId()
> will
> > give duplicate entries.
> >
> > In between your given query runs fine with the data i have generated in
> > house. So this is data dependent. So it will be good if we get
> > *origin_data* or the script which has build the data for origin_data.
> >
> > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> >
> >> dev
> >>    one row record
> >>
> >>
> >> yixu2001
> >>
> >> From: Liang Chen
> >> Date: 2017-10-19 22:26
> >> To: dev
> >> Subject: Re: Re: Update statement failed with "Multiple input rows
> >> matched for same row" in version 1.2.0,
> >> Hi
> >>
> >> Execute the below query, return one row record or multiple row records ?
> >> ---------------------------------
> >> select a.remark from  c_indextest1 a where a.id=b.id
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> yixu2001 wrote
> >> > dev
> >> >  You can follow the steps below to reproduce the problem.
> >> > tables c_indextest2 has 1700w records and table c_indextest1 has about
> >> 30w
> >> > records.
> >> >
> >> > step 1:
> >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> >> string)
> >> > STORED BY 'carbondata'").show;
> >> >
> >> > step 2: origin_data is a existing table with 1700w records, and fields
> >> in
> >> > the table do not matter.
> >> > cc.sql("insert into c_indextest2 select row_number() over(partition by
> >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> >> remark
> >> > from origin_data limit 17000000) a").show;
> >> >
> >> > step 3:
> >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> >> string)
> >> > STORED BY 'carbondata'").show;
> >> >
> >> > step 4:
> >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> >> > pmod(cast(ID as int),50)=43").show;
> >> >
> >> > setp 5:
> >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> >> > c_indextest1 a where a.id=b.id)").show;
> >> >
> >> >
> >> > yixu2001
> >> >
> >> > From: sounak
> >> > Date: 2017-10-19 18:26
> >> > To: dev
> >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> >> matched
> >> > for same row" in version 1.2.0,
> >> > Hi Yixu,
> >> >
> >> > Thanks for the ddl. But i am particularly interested on the data in
> >> order
> >> > to reproduce the problem. Will it be feasible to share to data or if
> it
> >> is
> >> > huge then the scripts that generates it? I cannot access the data
> >> location
> >> > from public network.
> >> >
> >> >
> >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> >>
> >> > yixu2001@
> >>
> >> > &gt; wrote:
> >> >
> >> >> dev
> >> >>
> >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> >> >> `id` INT,
> >> >> `CUST_ORDER_ID` STRING,
> >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> >> >> `CREATE_DATE` STRING,
> >> >> `UPDATE_DATE` STRING,
> >> >> `STATUS_CD` STRING,
> >> >> `STATUS_DATE` STRING,
> >> >> `AREA_ID` STRING,
> >> >> `REGION_CD` STRING,
> >> >> `UPDATE_STAFF` STRING,
> >> >> `CREATE_STAFF` STRING,
> >> >> `SHARDING_ID` STRING,
> >> >> `ORDER_ATTR_ID` STRING )
> >> >> ROW FORMAT DELIMITED
> >> >>   FIELDS TERMINATED BY '\u0001'
> >> >>   LINES TERMINATED BY '\n'
> >> >>   NULL DEFINED AS ''
> >> >> STORED AS INPUTFORMAT
> >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> >> >> OUTPUTFORMAT
> >> >>   'org.apache.hadoop.hive.ql.io
> >> >>
> >> >> .HiveIgnoreKeyTextOutputFormat'
> >> >> LOCATION
> >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> >> >>
> >> >> ';
> >> >>
> >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> data
> >> >> format in the csv file just as following
> >> >>
> >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> >> 302064,127859875,
> >> >> 9999999
> >> >>
> >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> >> STRING,ATTR_VALUE_IDATTR_VALUE
> >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> >> STRING,STATUS_DATE
> >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> STRING,CREATE_STAFF
> >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> 'carbondata'
> >> ")
> >> >>
> >> >> Step 3:Insert data:
> >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> >> >> qqdata2.h_indextest1").show(100,false);
> >> >>
> >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> >> >> qqdata2.c_indextest2
> >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> of
> >> >> qqdata2.c_indextest2 is 1700w.
> >> >>
> >> >>
> >> >>
> >> >> yixu2001
> >> >>
> >> >> From: sounak
> >> >> Date: 2017-10-19 14:13
> >> >> To: dev
> >> >> Subject: Re: Update statement failed with "Multiple input rows
> matched
> >> >> for
> >> >> same row" in version 1.2.0,
> >> >> Hi Yixu,
> >> >>
> >> >> Can you please share the DDLs and the Data for the above problem with
> >> us?
> >> >>
> >> >> Thanks
> >> >> Sounak
> >> >>
> >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> >>
> >> > yixu2001@
> >>
> >> > &gt; wrote:
> >> >>
> >> >> > dev
> >> >> >
> >> >> >
> >> >> > In carbondata version 1.2.0, I execute "update" statement with
> >> >> sub-query,
> >> >> > it failed.
> >> >> > All the rows in the 2 tables are not duplicated, and the same
> >> statement
> >> >> > will succeed in carbondata version 1.1.1.
> >> >> >
> >> >> > The test log as following:
> >> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> >> > qqdata2.c_indextest1").show(100,false);
> >> >> > +--------+------------------+
> >> >> > |count(1)|count(DISTINCT id)|
> >> >> > +--------+------------------+
> >> >> > |300000  |300000            |
> >> >> > +--------+------------------+
> >> >> >
> >> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> >> > qqdata2.c_indextest2").show(100,false);
> >> >> > +--------+------------------+
> >> >> > |count(1)|count(DISTINCT id)|
> >> >> > +--------+------------------+
> >> >> > |71223220|71223220          |
> >> >> > +--------+------------------+
> >> >> >
> >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> >> >> set(a.CUST_ORDER_ID,a.ORDER_
> >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> >> R_ATTR_ID)
> >> >> =
> >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> IDATTR_ID,b.ATTR_VALUE_IDATTR_
> >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> >> >> SHARDING_ID,b.ORDER_ATTR_ID
> >> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> of
> >> a
> >> >> > plan since it was too large. This behavior can be adjusted by
> setting
> >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> >> >> >
> >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> >> >> > qqdata2.c_indextest2
> >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> operation
> >> is
> >> >> > failed due to failure in creating delete delta file for segment :
> >> null
> >> >> > block : null
> >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
> >> >> update
> >> >> > operationjava.lang.Exception: Multiple input rows matched for same
> >> row.
> >> >> > java.lang.RuntimeException: Update operation failed. Multiple input
> >> >> rows
> >> >> > matched for same row.
> >> >> >   at scala.sys.package$.error(package.scala:27)
> >> >> >   at org.apache.spark.sql.execution.command.
> ProjectForUpdateCommand.
> >> >> > processData(IUDCommands.scala:239)
> >> >> >   at
> >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> >> >> > IUDCommands.scala:141)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > sideEffectResult$lzycompute(commands.scala:58)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > sideEffectResult(commands.scala:56)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > executeTake(commands.scala:71)
> >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> >> >> > executeCollect(limit.scala:38)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> >> nId(
> >> >> > SQLExecution.scala:57)
> >> >> >   at
> >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> >> >> >   at org.apache.spark.sql.Dataset.org
> >> >> >
> >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> >> >> >   at org.apache.spark.sql.Dataset.org
> >> >> >
> >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> >> > scala:2120)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> >> > scala:2119)
> >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> >> :2810)
> >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> >> >> >   ... 50 elided
> >> >> >
> >> >> >
> >> >> >
> >> >> > yixu2001
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Sounak
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks
> >> > Sounak
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> >> .nabble.com/
> >>
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>
 
 
 
-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

As the tupleIds are also unique, the only option left is to get hold of
your data and reproducing it in-house. Is it possible to generate your data
in-house or get it?

On Fri, Oct 20, 2017 at 12:14 PM, yixu2001 <yi...@163.com> wrote:

> dev
>
> scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id,
> getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=b.id)
> c group by c.tupleId having count(*)>1").show;
> +-------+--------+
> |tupleId|count(1)|
> +-------+--------+
> +-------+--------+
>
> no Multiple input rows
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-20 10:57
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Slight rectification in typo of the query syntax
>
> " select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
> c_indextest2
> b where a.id=b.id;"
>
> On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
>
> > Internally, we call a UDF to generate the TupleId and based on that
> > tupleId we decide if the row is duplicated or not. You can run a slightly
> > tweaked query
> >
> > " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
> c_indextest2
> > b where a.id=b.id; "
> >
> > In order to prompt Multiple Input Rows most probably the getTupleId()
> will
> > give duplicate entries.
> >
> > In between your given query runs fine with the data i have generated in
> > house. So this is data dependent. So it will be good if we get
> > *origin_data* or the script which has build the data for origin_data.
> >
> > On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
> >
> >> dev
> >>    one row record
> >>
> >>
> >> yixu2001
> >>
> >> From: Liang Chen
> >> Date: 2017-10-19 22:26
> >> To: dev
> >> Subject: Re: Re: Update statement failed with "Multiple input rows
> >> matched for same row" in version 1.2.0,
> >> Hi
> >>
> >> Execute the below query, return one row record or multiple row records ?
> >> ---------------------------------
> >> select a.remark from  c_indextest1 a where a.id=b.id
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> yixu2001 wrote
> >> > dev
> >> >  You can follow the steps below to reproduce the problem.
> >> > tables c_indextest2 has 1700w records and table c_indextest1 has about
> >> 30w
> >> > records.
> >> >
> >> > step 1:
> >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> >> string)
> >> > STORED BY 'carbondata'").show;
> >> >
> >> > step 2: origin_data is a existing table with 1700w records, and fields
> >> in
> >> > the table do not matter.
> >> > cc.sql("insert into c_indextest2 select row_number() over(partition by
> >> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
> >> remark
> >> > from origin_data limit 17000000) a").show;
> >> >
> >> > step 3:
> >> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> >> string)
> >> > STORED BY 'carbondata'").show;
> >> >
> >> > step 4:
> >> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> >> > pmod(cast(ID as int),50)=43").show;
> >> >
> >> > setp 5:
> >> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> >> > c_indextest1 a where a.id=b.id)").show;
> >> >
> >> >
> >> > yixu2001
> >> >
> >> > From: sounak
> >> > Date: 2017-10-19 18:26
> >> > To: dev
> >> > Subject: Re: Re: Update statement failed with "Multiple input rows
> >> matched
> >> > for same row" in version 1.2.0,
> >> > Hi Yixu,
> >> >
> >> > Thanks for the ddl. But i am particularly interested on the data in
> >> order
> >> > to reproduce the problem. Will it be feasible to share to data or if
> it
> >> is
> >> > huge then the scripts that generates it? I cannot access the data
> >> location
> >> > from public network.
> >> >
> >> >
> >> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
> >>
> >> > yixu2001@
> >>
> >> > &gt; wrote:
> >> >
> >> >> dev
> >> >>
> >> >> Step 1:I make a hive table qqdata2.h_indextest1:
> >> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> >> >> `id` INT,
> >> >> `CUST_ORDER_ID` STRING,
> >> >> `ORDER_ITEM_IDATTR_ID` STRING,
> >> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> >> >> `CREATE_DATE` STRING,
> >> >> `UPDATE_DATE` STRING,
> >> >> `STATUS_CD` STRING,
> >> >> `STATUS_DATE` STRING,
> >> >> `AREA_ID` STRING,
> >> >> `REGION_CD` STRING,
> >> >> `UPDATE_STAFF` STRING,
> >> >> `CREATE_STAFF` STRING,
> >> >> `SHARDING_ID` STRING,
> >> >> `ORDER_ATTR_ID` STRING )
> >> >> ROW FORMAT DELIMITED
> >> >>   FIELDS TERMINATED BY '\u0001'
> >> >>   LINES TERMINATED BY '\n'
> >> >>   NULL DEFINED AS ''
> >> >> STORED AS INPUTFORMAT
> >> >>   'org.apache.hadoop.mapred.TextInputFormat'
> >> >> OUTPUTFORMAT
> >> >>   'org.apache.hadoop.hive.ql.io
> >> >>
> >> >> .HiveIgnoreKeyTextOutputFormat'
> >> >> LOCATION
> >> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> >> >>
> >> >> ';
> >> >>
> >> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the
> data
> >> >> format in the csv file just as following
> >> >>
> >> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> >> >> 201607101469099594,1299,20160711996575390,10,73,302063,
> >> 302064,127859875,
> >> >> 9999999
> >> >>
> >> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> >> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> >> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
> >> STRING,ATTR_VALUE_IDATTR_VALUE
> >> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> >> STRING,STATUS_DATE
> >> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF
> STRING,CREATE_STAFF
> >> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY
> 'carbondata'
> >> ")
> >> >>
> >> >> Step 3:Insert data:
> >> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> >> >> qqdata2.h_indextest1").show(100,false);
> >> >>
> >> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> >> >> qqdata2.c_indextest2
> >> >> The record number of qqdata2.c_indextest1 is 30w, the record number
> of
> >> >> qqdata2.c_indextest2 is 1700w.
> >> >>
> >> >>
> >> >>
> >> >> yixu2001
> >> >>
> >> >> From: sounak
> >> >> Date: 2017-10-19 14:13
> >> >> To: dev
> >> >> Subject: Re: Update statement failed with "Multiple input rows
> matched
> >> >> for
> >> >> same row" in version 1.2.0,
> >> >> Hi Yixu,
> >> >>
> >> >> Can you please share the DDLs and the Data for the above problem with
> >> us?
> >> >>
> >> >> Thanks
> >> >> Sounak
> >> >>
> >> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
> >>
> >> > yixu2001@
> >>
> >> > &gt; wrote:
> >> >>
> >> >> > dev
> >> >> >
> >> >> >
> >> >> > In carbondata version 1.2.0, I execute "update" statement with
> >> >> sub-query,
> >> >> > it failed.
> >> >> > All the rows in the 2 tables are not duplicated, and the same
> >> statement
> >> >> > will succeed in carbondata version 1.1.1.
> >> >> >
> >> >> > The test log as following:
> >> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> >> > qqdata2.c_indextest1").show(100,false);
> >> >> > +--------+------------------+
> >> >> > |count(1)|count(DISTINCT id)|
> >> >> > +--------+------------------+
> >> >> > |300000  |300000            |
> >> >> > +--------+------------------+
> >> >> >
> >> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> >> > qqdata2.c_indextest2").show(100,false);
> >> >> > +--------+------------------+
> >> >> > |count(1)|count(DISTINCT id)|
> >> >> > +--------+------------------+
> >> >> > |71223220|71223220          |
> >> >> > +--------+------------------+
> >> >> >
> >> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> >> >> set(a.CUST_ORDER_ID,a.ORDER_
> >> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> >> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> >> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
> >> R_ATTR_ID)
> >> >> =
> >> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_
> IDATTR_ID,b.ATTR_VALUE_IDATTR_
> >> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> >> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> >> >> SHARDING_ID,b.ORDER_ATTR_ID
> >> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> >> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation
> of
> >> a
> >> >> > plan since it was too large. This behavior can be adjusted by
> setting
> >> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> >> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> >> >> >
> >> >> > ][bigdata][Thread-1]Delete data operation is failed for
> >> >> > qqdata2.c_indextest2
> >> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data
> operation
> >> is
> >> >> > failed due to failure in creating delete delta file for segment :
> >> null
> >> >> > block : null
> >> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
> >> >> update
> >> >> > operationjava.lang.Exception: Multiple input rows matched for same
> >> row.
> >> >> > java.lang.RuntimeException: Update operation failed. Multiple input
> >> >> rows
> >> >> > matched for same row.
> >> >> >   at scala.sys.package$.error(package.scala:27)
> >> >> >   at org.apache.spark.sql.execution.command.
> ProjectForUpdateCommand.
> >> >> > processData(IUDCommands.scala:239)
> >> >> >   at
> >> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> >> >> > IUDCommands.scala:141)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > sideEffectResult$lzycompute(commands.scala:58)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > sideEffectResult(commands.scala:56)
> >> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> >> > executeTake(commands.scala:71)
> >> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> >> >> > executeCollect(limit.scala:38)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> >> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> >> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> >> nId(
> >> >> > SQLExecution.scala:57)
> >> >> >   at
> >> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> >> >> >   at org.apache.spark.sql.Dataset.org
> >> >> >
> >> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> >> >> >   at org.apache.spark.sql.Dataset.org
> >> >> >
> >> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> >> > scala:2120)
> >> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> >> > scala:2119)
> >> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
> >> :2810)
> >> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> >> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> >> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> >> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> >> >> >   ... 50 elided
> >> >> >
> >> >> >
> >> >> >
> >> >> > yixu2001
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> Sounak
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks
> >> > Sounak
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5
> >> .nabble.com/
> >>
> >
> >
> >
> > --
> > Thanks
> > Sounak
> >
>
>
>
> --
> Thanks
> Sounak
>



-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
 
scala> cc.sql(" select  c.tupleId,count(*)  from (select a.remark,a.id, getTupleId() as tupleId from c_indextest1 a,c_indextest2 b where a.id=b.id) c group by c.tupleId having count(*)>1").show;
+-------+--------+                                                              
|tupleId|count(1)|
+-------+--------+
+-------+--------+

no Multiple input rows 


yixu2001
 
From: sounak
Date: 2017-10-20 10:57
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Slight rectification in typo of the query syntax
 
" select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
c_indextest2
b where a.id=b.id;"
 
On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
 
> Internally, we call a UDF to generate the TupleId and based on that
> tupleId we decide if the row is duplicated or not. You can run a slightly
> tweaked query
>
> " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a, c_indextest2
> b where a.id=b.id; "
>
> In order to prompt Multiple Input Rows most probably the getTupleId() will
> give duplicate entries.
>
> In between your given query runs fine with the data i have generated in
> house. So this is data dependent. So it will be good if we get
> *origin_data* or the script which has build the data for origin_data.
>
> On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
>
>> dev
>>    one row record
>>
>>
>> yixu2001
>>
>> From: Liang Chen
>> Date: 2017-10-19 22:26
>> To: dev
>> Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched for same row" in version 1.2.0,
>> Hi
>>
>> Execute the below query, return one row record or multiple row records ?
>> ---------------------------------
>> select a.remark from  c_indextest1 a where a.id=b.id
>>
>> Regards
>> Liang
>>
>>
>> yixu2001 wrote
>> > dev
>> >  You can follow the steps below to reproduce the problem.
>> > tables c_indextest2 has 1700w records and table c_indextest1 has about
>> 30w
>> > records.
>> >
>> > step 1:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 2: origin_data is a existing table with 1700w records, and fields
>> in
>> > the table do not matter.
>> > cc.sql("insert into c_indextest2 select row_number() over(partition by
>> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
>> remark
>> > from origin_data limit 17000000) a").show;
>> >
>> > step 3:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 4:
>> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
>> > pmod(cast(ID as int),50)=43").show;
>> >
>> > setp 5:
>> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
>> > c_indextest1 a where a.id=b.id)").show;
>> >
>> >
>> > yixu2001
>> >
>> > From: sounak
>> > Date: 2017-10-19 18:26
>> > To: dev
>> > Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched
>> > for same row" in version 1.2.0,
>> > Hi Yixu,
>> >
>> > Thanks for the ddl. But i am particularly interested on the data in
>> order
>> > to reproduce the problem. Will it be feasible to share to data or if it
>> is
>> > huge then the scripts that generates it? I cannot access the data
>> location
>> > from public network.
>> >
>> >
>> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >
>> >> dev
>> >>
>> >> Step 1:I make a hive table qqdata2.h_indextest1:
>> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> >> `id` INT,
>> >> `CUST_ORDER_ID` STRING,
>> >> `ORDER_ITEM_IDATTR_ID` STRING,
>> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> >> `CREATE_DATE` STRING,
>> >> `UPDATE_DATE` STRING,
>> >> `STATUS_CD` STRING,
>> >> `STATUS_DATE` STRING,
>> >> `AREA_ID` STRING,
>> >> `REGION_CD` STRING,
>> >> `UPDATE_STAFF` STRING,
>> >> `CREATE_STAFF` STRING,
>> >> `SHARDING_ID` STRING,
>> >> `ORDER_ATTR_ID` STRING )
>> >> ROW FORMAT DELIMITED
>> >>   FIELDS TERMINATED BY '\u0001'
>> >>   LINES TERMINATED BY '\n'
>> >>   NULL DEFINED AS ''
>> >> STORED AS INPUTFORMAT
>> >>   'org.apache.hadoop.mapred.TextInputFormat'
>> >> OUTPUTFORMAT
>> >>   'org.apache.hadoop.hive.ql.io
>> >>
>> >> .HiveIgnoreKeyTextOutputFormat'
>> >> LOCATION
>> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>> >>
>> >> ';
>> >>
>> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> >> format in the csv file just as following
>> >>
>> >> 1939505130,171483932,287305502,813463930,20160709134396669,
>> >> 201607101469099594,1299,20160711996575390,10,73,302063,
>> 302064,127859875,
>> >> 9999999
>> >>
>> >> Step 2:I make a carbon table qqdata2.c_indextest1:
>> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
>> STRING,ATTR_VALUE_IDATTR_VALUE
>> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
>> STRING,STATUS_DATE
>> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata'
>> ")
>> >>
>> >> Step 3:Insert data:
>> >> cc.sql("insert into qqdata2.c_indextest1 select * from
>> >> qqdata2.h_indextest1").show(100,false);
>> >>
>> >> Step 4: Repeat from step1 to step 3, I make another carbon table
>> >> qqdata2.c_indextest2
>> >> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> >> qqdata2.c_indextest2 is 1700w.
>> >>
>> >>
>> >>
>> >> yixu2001
>> >>
>> >> From: sounak
>> >> Date: 2017-10-19 14:13
>> >> To: dev
>> >> Subject: Re: Update statement failed with "Multiple input rows matched
>> >> for
>> >> same row" in version 1.2.0,
>> >> Hi Yixu,
>> >>
>> >> Can you please share the DDLs and the Data for the above problem with
>> us?
>> >>
>> >> Thanks
>> >> Sounak
>> >>
>> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >>
>> >> > dev
>> >> >
>> >> >
>> >> > In carbondata version 1.2.0, I execute "update" statement with
>> >> sub-query,
>> >> > it failed.
>> >> > All the rows in the 2 tables are not duplicated, and the same
>> statement
>> >> > will succeed in carbondata version 1.1.1.
>> >> >
>> >> > The test log as following:
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest1").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |300000  |300000            |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest2").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |71223220|71223220          |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("update qqdata2.c_indextest2 a
>> >> set(a.CUST_ORDER_ID,a.ORDER_
>> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
>> R_ATTR_ID)
>> >> =
>> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> >> SHARDING_ID,b.ORDER_ATTR_ID
>> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of
>> a
>> >> > plan since it was too large. This behavior can be adjusted by setting
>> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >> >
>> >> > ][bigdata][Thread-1]Delete data operation is failed for
>> >> > qqdata2.c_indextest2
>> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation
>> is
>> >> > failed due to failure in creating delete delta file for segment :
>> null
>> >> > block : null
>> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> >> update
>> >> > operationjava.lang.Exception: Multiple input rows matched for same
>> row.
>> >> > java.lang.RuntimeException: Update operation failed. Multiple input
>> >> rows
>> >> > matched for same row.
>> >> >   at scala.sys.package$.error(package.scala:27)
>> >> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> >> > processData(IUDCommands.scala:239)
>> >> >   at
>> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> >> > IUDCommands.scala:141)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult$lzycompute(commands.scala:58)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult(commands.scala:56)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > executeTake(commands.scala:71)
>> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> >> > executeCollect(limit.scala:38)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(
>> >> > SQLExecution.scala:57)
>> >> >   at
>> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2120)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
>> :2810)
>> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >> >   ... 50 elided
>> >> >
>> >> >
>> >> >
>> >> > yixu2001
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Sounak
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Sounak
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5
>> .nabble.com/
>>
>
>
>
> --
> Thanks
> Sounak
>
 
 
 
-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 


a.id value to b.id value  one to one
remark   Multiple Input Rows
 scala> cc.sql("select count(a.remark),count(distinct a.remark),count(a.id),count(distinct a.id),count(b.id),count(distinct b.id)  from c_indextest1 a,c_indextest2 b  where a.id=b.id").show;
+-------------+----------------------+---------+------------------+---------+------------------+
|count(remark)|  count(DISTINCT remark)|  count(a.id)  |count(DISTINCT a.id)|count(b.id)          |   count(DISTINCT b.id)|
+-------------+----------------------+---------+------------------+---------+------------------+
|       340000|                     1                         |      340000  |         340000          |       340000        |  340000                    |
+-------------+----------------------+---------+------------------+---------+------------------+

scala> cc.sql("select count(b.id),count(distinct b.id),count(b.remark),count(distinct b.remark) from c_indextest2 b").show;
+---------+------------------+-------------+----------------------+             
|count(id)|count(DISTINCT id)|count(remark)|count(DISTINCT remark)|
+---------+------------------+-------------+----------------------+
| 17000000|          17000000|     17000000|                     1|
+---------+------------------+-------------+----------------------+


scala> cc.sql("select count(b.id),count(distinct b.id),count(b.remark),count(distinct b.remark) from c_indextest1 b").show;
+---------+------------------+-------------+----------------------+             
|count(id)|count(DISTINCT id)|count(remark)|count(DISTINCT remark)|
+---------+------------------+-------------+----------------------+
|   340000|            340000|       340000|                     1|
+---------+------------------+-------------+----------------------+

a.id value to b.id value  one to one
a.remark update b.remark  

cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from  c_indextest1 a where a.id=b.id)").show;
a.id value to b.id value  one to one



yixu2001
 
From: sounak
Date: 2017-10-20 10:57
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Slight rectification in typo of the query syntax
 
" select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
c_indextest2
b where a.id=b.id;"
 
On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:
 
> Internally, we call a UDF to generate the TupleId and based on that
> tupleId we decide if the row is duplicated or not. You can run a slightly
> tweaked query
>
> " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a, c_indextest2
> b where a.id=b.id; "
>
> In order to prompt Multiple Input Rows most probably the getTupleId() will
> give duplicate entries.
>
> In between your given query runs fine with the data i have generated in
> house. So this is data dependent. So it will be good if we get
> *origin_data* or the script which has build the data for origin_data.
>
> On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
>
>> dev
>>    one row record
>>
>>
>> yixu2001
>>
>> From: Liang Chen
>> Date: 2017-10-19 22:26
>> To: dev
>> Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched for same row" in version 1.2.0,
>> Hi
>>
>> Execute the below query, return one row record or multiple row records ?
>> ---------------------------------
>> select a.remark from  c_indextest1 a where a.id=b.id
>>
>> Regards
>> Liang
>>
>>
>> yixu2001 wrote
>> > dev
>> >  You can follow the steps below to reproduce the problem.
>> > tables c_indextest2 has 1700w records and table c_indextest1 has about
>> 30w
>> > records.
>> >
>> > step 1:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 2: origin_data is a existing table with 1700w records, and fields
>> in
>> > the table do not matter.
>> > cc.sql("insert into c_indextest2 select row_number() over(partition by
>> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
>> remark
>> > from origin_data limit 17000000) a").show;
>> >
>> > step 3:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 4:
>> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
>> > pmod(cast(ID as int),50)=43").show;
>> >
>> > setp 5:
>> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
>> > c_indextest1 a where a.id=b.id)").show;
>> >
>> >
>> > yixu2001
>> >
>> > From: sounak
>> > Date: 2017-10-19 18:26
>> > To: dev
>> > Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched
>> > for same row" in version 1.2.0,
>> > Hi Yixu,
>> >
>> > Thanks for the ddl. But i am particularly interested on the data in
>> order
>> > to reproduce the problem. Will it be feasible to share to data or if it
>> is
>> > huge then the scripts that generates it? I cannot access the data
>> location
>> > from public network.
>> >
>> >
>> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >
>> >> dev
>> >>
>> >> Step 1:I make a hive table qqdata2.h_indextest1:
>> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> >> `id` INT,
>> >> `CUST_ORDER_ID` STRING,
>> >> `ORDER_ITEM_IDATTR_ID` STRING,
>> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> >> `CREATE_DATE` STRING,
>> >> `UPDATE_DATE` STRING,
>> >> `STATUS_CD` STRING,
>> >> `STATUS_DATE` STRING,
>> >> `AREA_ID` STRING,
>> >> `REGION_CD` STRING,
>> >> `UPDATE_STAFF` STRING,
>> >> `CREATE_STAFF` STRING,
>> >> `SHARDING_ID` STRING,
>> >> `ORDER_ATTR_ID` STRING )
>> >> ROW FORMAT DELIMITED
>> >>   FIELDS TERMINATED BY '\u0001'
>> >>   LINES TERMINATED BY '\n'
>> >>   NULL DEFINED AS ''
>> >> STORED AS INPUTFORMAT
>> >>   'org.apache.hadoop.mapred.TextInputFormat'
>> >> OUTPUTFORMAT
>> >>   'org.apache.hadoop.hive.ql.io
>> >>
>> >> .HiveIgnoreKeyTextOutputFormat'
>> >> LOCATION
>> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>> >>
>> >> ';
>> >>
>> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> >> format in the csv file just as following
>> >>
>> >> 1939505130,171483932,287305502,813463930,20160709134396669,
>> >> 201607101469099594,1299,20160711996575390,10,73,302063,
>> 302064,127859875,
>> >> 9999999
>> >>
>> >> Step 2:I make a carbon table qqdata2.c_indextest1:
>> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
>> STRING,ATTR_VALUE_IDATTR_VALUE
>> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
>> STRING,STATUS_DATE
>> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata'
>> ")
>> >>
>> >> Step 3:Insert data:
>> >> cc.sql("insert into qqdata2.c_indextest1 select * from
>> >> qqdata2.h_indextest1").show(100,false);
>> >>
>> >> Step 4: Repeat from step1 to step 3, I make another carbon table
>> >> qqdata2.c_indextest2
>> >> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> >> qqdata2.c_indextest2 is 1700w.
>> >>
>> >>
>> >>
>> >> yixu2001
>> >>
>> >> From: sounak
>> >> Date: 2017-10-19 14:13
>> >> To: dev
>> >> Subject: Re: Update statement failed with "Multiple input rows matched
>> >> for
>> >> same row" in version 1.2.0,
>> >> Hi Yixu,
>> >>
>> >> Can you please share the DDLs and the Data for the above problem with
>> us?
>> >>
>> >> Thanks
>> >> Sounak
>> >>
>> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >>
>> >> > dev
>> >> >
>> >> >
>> >> > In carbondata version 1.2.0, I execute "update" statement with
>> >> sub-query,
>> >> > it failed.
>> >> > All the rows in the 2 tables are not duplicated, and the same
>> statement
>> >> > will succeed in carbondata version 1.1.1.
>> >> >
>> >> > The test log as following:
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest1").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |300000  |300000            |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest2").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |71223220|71223220          |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("update qqdata2.c_indextest2 a
>> >> set(a.CUST_ORDER_ID,a.ORDER_
>> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
>> R_ATTR_ID)
>> >> =
>> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> >> SHARDING_ID,b.ORDER_ATTR_ID
>> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of
>> a
>> >> > plan since it was too large. This behavior can be adjusted by setting
>> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >> >
>> >> > ][bigdata][Thread-1]Delete data operation is failed for
>> >> > qqdata2.c_indextest2
>> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation
>> is
>> >> > failed due to failure in creating delete delta file for segment :
>> null
>> >> > block : null
>> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> >> update
>> >> > operationjava.lang.Exception: Multiple input rows matched for same
>> row.
>> >> > java.lang.RuntimeException: Update operation failed. Multiple input
>> >> rows
>> >> > matched for same row.
>> >> >   at scala.sys.package$.error(package.scala:27)
>> >> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> >> > processData(IUDCommands.scala:239)
>> >> >   at
>> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> >> > IUDCommands.scala:141)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult$lzycompute(commands.scala:58)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult(commands.scala:56)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > executeTake(commands.scala:71)
>> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> >> > executeCollect(limit.scala:38)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(
>> >> > SQLExecution.scala:57)
>> >> >   at
>> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2120)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
>> :2810)
>> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >> >   ... 50 elided
>> >> >
>> >> >
>> >> >
>> >> > yixu2001
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Sounak
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Sounak
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5
>> .nabble.com/
>>
>
>
>
> --
> Thanks
> Sounak
>
 
 
 
-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

Slight rectification in typo of the query syntax

" select a.remark,a.id, getTupleId() as *tupleId* from c_indextest1 a,
c_indextest2
b where a.id=b.id;"

On Fri, Oct 20, 2017 at 7:54 AM, sounak <so...@gmail.com> wrote:

> Internally, we call a UDF to generate the TupleId and based on that
> tupleId we decide if the row is duplicated or not. You can run a slightly
> tweaked query
>
> " select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a, c_indextest2
> b where a.id=b.id; "
>
> In order to prompt Multiple Input Rows most probably the getTupleId() will
> give duplicate entries.
>
> In between your given query runs fine with the data i have generated in
> house. So this is data dependent. So it will be good if we get
> *origin_data* or the script which has build the data for origin_data.
>
> On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:
>
>> dev
>>    one row record
>>
>>
>> yixu2001
>>
>> From: Liang Chen
>> Date: 2017-10-19 22:26
>> To: dev
>> Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched for same row" in version 1.2.0,
>> Hi
>>
>> Execute the below query, return one row record or multiple row records ?
>> ---------------------------------
>> select a.remark from  c_indextest1 a where a.id=b.id
>>
>> Regards
>> Liang
>>
>>
>> yixu2001 wrote
>> > dev
>> >  You can follow the steps below to reproduce the problem.
>> > tables c_indextest2 has 1700w records and table c_indextest1 has about
>> 30w
>> > records.
>> >
>> > step 1:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 2: origin_data is a existing table with 1700w records, and fields
>> in
>> > the table do not matter.
>> > cc.sql("insert into c_indextest2 select row_number() over(partition by
>> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd'
>> remark
>> > from origin_data limit 17000000) a").show;
>> >
>> > step 3:
>> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
>> string)
>> > STORED BY 'carbondata'").show;
>> >
>> > step 4:
>> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
>> > pmod(cast(ID as int),50)=43").show;
>> >
>> > setp 5:
>> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
>> > c_indextest1 a where a.id=b.id)").show;
>> >
>> >
>> > yixu2001
>> >
>> > From: sounak
>> > Date: 2017-10-19 18:26
>> > To: dev
>> > Subject: Re: Re: Update statement failed with "Multiple input rows
>> matched
>> > for same row" in version 1.2.0,
>> > Hi Yixu,
>> >
>> > Thanks for the ddl. But i am particularly interested on the data in
>> order
>> > to reproduce the problem. Will it be feasible to share to data or if it
>> is
>> > huge then the scripts that generates it? I cannot access the data
>> location
>> > from public network.
>> >
>> >
>> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >
>> >> dev
>> >>
>> >> Step 1:I make a hive table qqdata2.h_indextest1:
>> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> >> `id` INT,
>> >> `CUST_ORDER_ID` STRING,
>> >> `ORDER_ITEM_IDATTR_ID` STRING,
>> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> >> `CREATE_DATE` STRING,
>> >> `UPDATE_DATE` STRING,
>> >> `STATUS_CD` STRING,
>> >> `STATUS_DATE` STRING,
>> >> `AREA_ID` STRING,
>> >> `REGION_CD` STRING,
>> >> `UPDATE_STAFF` STRING,
>> >> `CREATE_STAFF` STRING,
>> >> `SHARDING_ID` STRING,
>> >> `ORDER_ATTR_ID` STRING )
>> >> ROW FORMAT DELIMITED
>> >>   FIELDS TERMINATED BY '\u0001'
>> >>   LINES TERMINATED BY '\n'
>> >>   NULL DEFINED AS ''
>> >> STORED AS INPUTFORMAT
>> >>   'org.apache.hadoop.mapred.TextInputFormat'
>> >> OUTPUTFORMAT
>> >>   'org.apache.hadoop.hive.ql.io
>> >>
>> >> .HiveIgnoreKeyTextOutputFormat'
>> >> LOCATION
>> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>> >>
>> >> ';
>> >>
>> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> >> format in the csv file just as following
>> >>
>> >> 1939505130,171483932,287305502,813463930,20160709134396669,
>> >> 201607101469099594,1299,20160711996575390,10,73,302063,
>> 302064,127859875,
>> >> 9999999
>> >>
>> >> Step 2:I make a carbon table qqdata2.c_indextest1:
>> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID
>> STRING,ATTR_VALUE_IDATTR_VALUE
>> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
>> STRING,STATUS_DATE
>> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata'
>> ")
>> >>
>> >> Step 3:Insert data:
>> >> cc.sql("insert into qqdata2.c_indextest1 select * from
>> >> qqdata2.h_indextest1").show(100,false);
>> >>
>> >> Step 4: Repeat from step1 to step 3, I make another carbon table
>> >> qqdata2.c_indextest2
>> >> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> >> qqdata2.c_indextest2 is 1700w.
>> >>
>> >>
>> >>
>> >> yixu2001
>> >>
>> >> From: sounak
>> >> Date: 2017-10-19 14:13
>> >> To: dev
>> >> Subject: Re: Update statement failed with "Multiple input rows matched
>> >> for
>> >> same row" in version 1.2.0,
>> >> Hi Yixu,
>> >>
>> >> Can you please share the DDLs and the Data for the above problem with
>> us?
>> >>
>> >> Thanks
>> >> Sounak
>> >>
>> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
>>
>> > yixu2001@
>>
>> > &gt; wrote:
>> >>
>> >> > dev
>> >> >
>> >> >
>> >> > In carbondata version 1.2.0, I execute "update" statement with
>> >> sub-query,
>> >> > it failed.
>> >> > All the rows in the 2 tables are not duplicated, and the same
>> statement
>> >> > will succeed in carbondata version 1.1.1.
>> >> >
>> >> > The test log as following:
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest1").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |300000  |300000            |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("select count(*), count(distinct id) from
>> >> > qqdata2.c_indextest2").show(100,false);
>> >> > +--------+------------------+
>> >> > |count(1)|count(DISTINCT id)|
>> >> > +--------+------------------+
>> >> > |71223220|71223220          |
>> >> > +--------+------------------+
>> >> >
>> >> > scala> cc.sql("update qqdata2.c_indextest2 a
>> >> set(a.CUST_ORDER_ID,a.ORDER_
>> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDE
>> R_ATTR_ID)
>> >> =
>> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> >> SHARDING_ID,b.ORDER_ATTR_ID
>> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of
>> a
>> >> > plan since it was too large. This behavior can be adjusted by setting
>> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >> >
>> >> > ][bigdata][Thread-1]Delete data operation is failed for
>> >> > qqdata2.c_indextest2
>> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation
>> is
>> >> > failed due to failure in creating delete delta file for segment :
>> null
>> >> > block : null
>> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> >> update
>> >> > operationjava.lang.Exception: Multiple input rows matched for same
>> row.
>> >> > java.lang.RuntimeException: Update operation failed. Multiple input
>> >> rows
>> >> > matched for same row.
>> >> >   at scala.sys.package$.error(package.scala:27)
>> >> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> >> > processData(IUDCommands.scala:239)
>> >> >   at
>> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> >> > IUDCommands.scala:141)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult$lzycompute(commands.scala:58)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > sideEffectResult(commands.scala:56)
>> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> >> > executeTake(commands.scala:71)
>> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> >> > executeCollect(limit.scala:38)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(
>> >> > SQLExecution.scala:57)
>> >> >   at
>> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >> >   at org.apache.spark.sql.Dataset.org
>> >> >
>> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2120)
>> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> >> > scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala
>> :2810)
>> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >> >   ... 50 elided
>> >> >
>> >> >
>> >> >
>> >> > yixu2001
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Sounak
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Sounak
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5
>> .nabble.com/
>>
>
>
>
> --
> Thanks
> Sounak
>



-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

Internally, we call a UDF to generate the TupleId and based on that tupleId
we decide if the row is duplicated or not. You can run a slightly tweaked
query

" select a.remark,a.id, getTupleId() as TupleId from c_indextest1 a,
c_indextest2
b where a.id=b.id; "

In order to prompt Multiple Input Rows most probably the getTupleId() will
give duplicate entries.

In between your given query runs fine with the data i have generated in
house. So this is data dependent. So it will be good if we get *origin_data*
or the script which has build the data for origin_data.

On Fri, Oct 20, 2017 at 7:28 AM, yixu2001 <yi...@163.com> wrote:

> dev
>    one row record
>
>
> yixu2001
>
> From: Liang Chen
> Date: 2017-10-19 22:26
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Hi
>
> Execute the below query, return one row record or multiple row records ?
> ---------------------------------
> select a.remark from  c_indextest1 a where a.id=b.id
>
> Regards
> Liang
>
>
> yixu2001 wrote
> > dev
> >  You can follow the steps below to reproduce the problem.
> > tables c_indextest2 has 1700w records and table c_indextest1 has about
> 30w
> > records.
> >
> > step 1:
> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark
> string)
> > STORED BY 'carbondata'").show;
> >
> > step 2: origin_data is a existing table with 1700w records, and fields in
> > the table do not matter.
> > cc.sql("insert into c_indextest2 select row_number() over(partition by
> > a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd' remark
> > from origin_data limit 17000000) a").show;
> >
> > step 3:
> > cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark
> string)
> > STORED BY 'carbondata'").show;
> >
> > step 4:
> > cc.sql("insert into c_indextest1 select * from  c_indextest2 where
> > pmod(cast(ID as int),50)=43").show;
> >
> > setp 5:
> > cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from
> > c_indextest1 a where a.id=b.id)").show;
> >
> >
> > yixu2001
> >
> > From: sounak
> > Date: 2017-10-19 18:26
> > To: dev
> > Subject: Re: Re: Update statement failed with "Multiple input rows
> matched
> > for same row" in version 1.2.0,
> > Hi Yixu,
> >
> > Thanks for the ddl. But i am particularly interested on the data in order
> > to reproduce the problem. Will it be feasible to share to data or if it
> is
> > huge then the scripts that generates it? I cannot access the data
> location
> > from public network.
> >
> >
> > On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
>
> > yixu2001@
>
> > &gt; wrote:
> >
> >> dev
> >>
> >> Step 1:I make a hive table qqdata2.h_indextest1:
> >> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> >> `id` INT,
> >> `CUST_ORDER_ID` STRING,
> >> `ORDER_ITEM_IDATTR_ID` STRING,
> >> `ATTR_VALUE_IDATTR_VALUE` STRING,
> >> `CREATE_DATE` STRING,
> >> `UPDATE_DATE` STRING,
> >> `STATUS_CD` STRING,
> >> `STATUS_DATE` STRING,
> >> `AREA_ID` STRING,
> >> `REGION_CD` STRING,
> >> `UPDATE_STAFF` STRING,
> >> `CREATE_STAFF` STRING,
> >> `SHARDING_ID` STRING,
> >> `ORDER_ATTR_ID` STRING )
> >> ROW FORMAT DELIMITED
> >>   FIELDS TERMINATED BY '\u0001'
> >>   LINES TERMINATED BY '\n'
> >>   NULL DEFINED AS ''
> >> STORED AS INPUTFORMAT
> >>   'org.apache.hadoop.mapred.TextInputFormat'
> >> OUTPUTFORMAT
> >>   'org.apache.hadoop.hive.ql.io
> >>
> >> .HiveIgnoreKeyTextOutputFormat'
> >> LOCATION
> >>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
> >>
> >> ';
> >>
> >> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
> >> format in the csv file just as following
> >>
> >> 1939505130,171483932,287305502,813463930,20160709134396669,
> >> 201607101469099594,1299,20160711996575390,10,73,
> 302063,302064,127859875,
> >> 9999999
> >>
> >> Step 2:I make a carbon table qqdata2.c_indextest1:
> >> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> >> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
> >> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD
> STRING,STATUS_DATE
> >> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
> >> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata'
> ")
> >>
> >> Step 3:Insert data:
> >> cc.sql("insert into qqdata2.c_indextest1 select * from
> >> qqdata2.h_indextest1").show(100,false);
> >>
> >> Step 4: Repeat from step1 to step 3, I make another carbon table
> >> qqdata2.c_indextest2
> >> The record number of qqdata2.c_indextest1 is 30w, the record number of
> >> qqdata2.c_indextest2 is 1700w.
> >>
> >>
> >>
> >> yixu2001
> >>
> >> From: sounak
> >> Date: 2017-10-19 14:13
> >> To: dev
> >> Subject: Re: Update statement failed with "Multiple input rows matched
> >> for
> >> same row" in version 1.2.0,
> >> Hi Yixu,
> >>
> >> Can you please share the DDLs and the Data for the above problem with
> us?
> >>
> >> Thanks
> >> Sounak
> >>
> >> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
>
> > yixu2001@
>
> > &gt; wrote:
> >>
> >> > dev
> >> >
> >> >
> >> > In carbondata version 1.2.0, I execute "update" statement with
> >> sub-query,
> >> > it failed.
> >> > All the rows in the 2 tables are not duplicated, and the same
> statement
> >> > will succeed in carbondata version 1.1.1.
> >> >
> >> > The test log as following:
> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> > qqdata2.c_indextest1").show(100,false);
> >> > +--------+------------------+
> >> > |count(1)|count(DISTINCT id)|
> >> > +--------+------------------+
> >> > |300000  |300000            |
> >> > +--------+------------------+
> >> >
> >> > scala> cc.sql("select count(*), count(distinct id) from
> >> > qqdata2.c_indextest2").show(100,false);
> >> > +--------+------------------+
> >> > |count(1)|count(DISTINCT id)|
> >> > +--------+------------------+
> >> > |71223220|71223220          |
> >> > +--------+------------------+
> >> >
> >> > scala> cc.sql("update qqdata2.c_indextest2 a
> >> set(a.CUST_ORDER_ID,a.ORDER_
> >> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> >> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> >> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.
> ORDER_ATTR_ID)
> >> =
> >> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
> >> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> >> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> >> SHARDING_ID,b.ORDER_ATTR_ID
> >> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> >> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
> >> > plan since it was too large. This behavior can be adjusted by setting
> >> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> >> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> >> >
> >> > ][bigdata][Thread-1]Delete data operation is failed for
> >> > qqdata2.c_indextest2
> >> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation
> is
> >> > failed due to failure in creating delete delta file for segment : null
> >> > block : null
> >> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
> >> update
> >> > operationjava.lang.Exception: Multiple input rows matched for same
> row.
> >> > java.lang.RuntimeException: Update operation failed. Multiple input
> >> rows
> >> > matched for same row.
> >> >   at scala.sys.package$.error(package.scala:27)
> >> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
> >> > processData(IUDCommands.scala:239)
> >> >   at
> >> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> >> > IUDCommands.scala:141)
> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> > sideEffectResult$lzycompute(commands.scala:58)
> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> > sideEffectResult(commands.scala:56)
> >> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> >> > executeTake(commands.scala:71)
> >> >   at org.apache.spark.sql.execution.CollectLimitExec.
> >> > executeCollect(limit.scala:38)
> >> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> >> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> >> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> >> > SQLExecution.scala:57)
> >> >   at
> >> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> >> >   at org.apache.spark.sql.Dataset.org
> >> >
> >> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> >> >   at org.apache.spark.sql.Dataset.org
> >> >
> >> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> > scala:2120)
> >> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> >> > scala:2119)
> >> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.
> scala:2810)
> >> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> >> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> >> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> >> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> >> >   ... 50 elided
> >> >
> >> >
> >> >
> >> > yixu2001
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks
> >> Sounak
> >>
> >
> >
> >
> > --
> > Thanks
> > Sounak
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
   one row record


yixu2001
 
From: Liang Chen
Date: 2017-10-19 22:26
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Hi
 
Execute the below query, return one row record or multiple row records ?
---------------------------------
select a.remark from  c_indextest1 a where a.id=b.id 
 
Regards
Liang
 
 
yixu2001 wrote
> dev 
>  You can follow the steps below to reproduce the problem.
> tables c_indextest2 has 1700w records and table c_indextest1 has about 30w
> records.
> 
> step 1:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 2: origin_data is a existing table with 1700w records, and fields in
> the table do not matter.
> cc.sql("insert into c_indextest2 select row_number() over(partition by
> a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd' remark
> from origin_data limit 17000000) a").show;
> 
> step 3:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 4:
> cc.sql("insert into c_indextest1 select * from  c_indextest2 where 
> pmod(cast(ID as int),50)=43").show;
> 
> setp 5:
> cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from 
> c_indextest1 a where a.id=b.id)").show;
> 
> 
> yixu2001
>  
> From: sounak
> Date: 2017-10-19 18:26
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Hi Yixu,
>  
> Thanks for the ddl. But i am particularly interested on the data in order
> to reproduce the problem. Will it be feasible to share to data or if it is
> huge then the scripts that generates it? I cannot access the data location
> from public network.
>  
>  
> On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
 
> yixu2001@
 
> &gt; wrote:
>  
>> dev
>>
>> Step 1:I make a hive table qqdata2.h_indextest1:
>> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> `id` INT,
>> `CUST_ORDER_ID` STRING,
>> `ORDER_ITEM_IDATTR_ID` STRING,
>> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> `CREATE_DATE` STRING,
>> `UPDATE_DATE` STRING,
>> `STATUS_CD` STRING,
>> `STATUS_DATE` STRING,
>> `AREA_ID` STRING,
>> `REGION_CD` STRING,
>> `UPDATE_STAFF` STRING,
>> `CREATE_STAFF` STRING,
>> `SHARDING_ID` STRING,
>> `ORDER_ATTR_ID` STRING )
>> ROW FORMAT DELIMITED
>>   FIELDS TERMINATED BY '\u0001'
>>   LINES TERMINATED BY '\n'
>>   NULL DEFINED AS ''
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.mapred.TextInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io
>>
>> .HiveIgnoreKeyTextOutputFormat'
>> LOCATION
>>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>>
>> ';
>>
>> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> format in the csv file just as following
>>
>> 1939505130,171483932,287305502,813463930,20160709134396669,
>> 201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,
>> 9999999
>>
>> Step 2:I make a carbon table qqdata2.c_indextest1:
>> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
>> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE
>> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")
>>
>> Step 3:Insert data:
>> cc.sql("insert into qqdata2.c_indextest1 select * from
>> qqdata2.h_indextest1").show(100,false);
>>
>> Step 4: Repeat from step1 to step 3, I make another carbon table
>> qqdata2.c_indextest2
>> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> qqdata2.c_indextest2 is 1700w.
>>
>>
>>
>> yixu2001
>>
>> From: sounak
>> Date: 2017-10-19 14:13
>> To: dev
>> Subject: Re: Update statement failed with "Multiple input rows matched
>> for
>> same row" in version 1.2.0,
>> Hi Yixu,
>>
>> Can you please share the DDLs and the Data for the above problem with us?
>>
>> Thanks
>> Sounak
>>
>> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
 
> yixu2001@
 
> &gt; wrote:
>>
>> > dev
>> >
>> >
>> > In carbondata version 1.2.0, I execute "update" statement with
>> sub-query,
>> > it failed.
>> > All the rows in the 2 tables are not duplicated, and the same statement
>> > will succeed in carbondata version 1.1.1.
>> >
>> > The test log as following:
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest1").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |300000  |300000            |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest2").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |71223220|71223220          |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("update qqdata2.c_indextest2 a
>> set(a.CUST_ORDER_ID,a.ORDER_
>> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID)
>> =
>> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> SHARDING_ID,b.ORDER_ATTR_ID
>> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
>> > plan since it was too large. This behavior can be adjusted by setting
>> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >
>> > ][bigdata][Thread-1]Delete data operation is failed for
>> > qqdata2.c_indextest2
>> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
>> > failed due to failure in creating delete delta file for segment : null
>> > block : null
>> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> update
>> > operationjava.lang.Exception: Multiple input rows matched for same row.
>> > java.lang.RuntimeException: Update operation failed. Multiple input
>> rows
>> > matched for same row.
>> >   at scala.sys.package$.error(package.scala:27)
>> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> > processData(IUDCommands.scala:239)
>> >   at
>> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> > IUDCommands.scala:141)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult$lzycompute(commands.scala:58)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult(commands.scala:56)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > executeTake(commands.scala:71)
>> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> > executeCollect(limit.scala:38)
>> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
>> > SQLExecution.scala:57)
>> >   at
>> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2120)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2119)
>> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
>> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >   ... 50 elided
>> >
>> >
>> >
>> > yixu2001
>> >
>>
>>
>>
>> --
>> Thanks
>> Sounak
>>
>  
>  
>  
> -- 
> Thanks
> Sounak
 
 
 
 
 
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
 Sorry, I have made a mistake.

Execute the below query, it return multiple row records.

select a.remark from  c_indextest1 a,c_indextest2 b where a.id=b.id


yixu2001
 
From: Liang Chen
Date: 2017-10-19 22:26
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Hi
 
Execute the below query, return one row record or multiple row records ?
---------------------------------
select a.remark from  c_indextest1 a where a.id=b.id 
 
Regards
Liang
 
 
yixu2001 wrote
> dev 
>  You can follow the steps below to reproduce the problem.
> tables c_indextest2 has 1700w records and table c_indextest1 has about 30w
> records.
> 
> step 1:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 2: origin_data is a existing table with 1700w records, and fields in
> the table do not matter.
> cc.sql("insert into c_indextest2 select row_number() over(partition by
> a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd' remark
> from origin_data limit 17000000) a").show;
> 
> step 3:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 4:
> cc.sql("insert into c_indextest1 select * from  c_indextest2 where 
> pmod(cast(ID as int),50)=43").show;
> 
> setp 5:
> cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from 
> c_indextest1 a where a.id=b.id)").show;
> 
> 
> yixu2001
>  
> From: sounak
> Date: 2017-10-19 18:26
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Hi Yixu,
>  
> Thanks for the ddl. But i am particularly interested on the data in order
> to reproduce the problem. Will it be feasible to share to data or if it is
> huge then the scripts that generates it? I cannot access the data location
> from public network.
>  
>  
> On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;
 
> yixu2001@
 
> &gt; wrote:
>  
>> dev
>>
>> Step 1:I make a hive table qqdata2.h_indextest1:
>> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> `id` INT,
>> `CUST_ORDER_ID` STRING,
>> `ORDER_ITEM_IDATTR_ID` STRING,
>> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> `CREATE_DATE` STRING,
>> `UPDATE_DATE` STRING,
>> `STATUS_CD` STRING,
>> `STATUS_DATE` STRING,
>> `AREA_ID` STRING,
>> `REGION_CD` STRING,
>> `UPDATE_STAFF` STRING,
>> `CREATE_STAFF` STRING,
>> `SHARDING_ID` STRING,
>> `ORDER_ATTR_ID` STRING )
>> ROW FORMAT DELIMITED
>>   FIELDS TERMINATED BY '\u0001'
>>   LINES TERMINATED BY '\n'
>>   NULL DEFINED AS ''
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.mapred.TextInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io
>>
>> .HiveIgnoreKeyTextOutputFormat'
>> LOCATION
>>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>>
>> ';
>>
>> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> format in the csv file just as following
>>
>> 1939505130,171483932,287305502,813463930,20160709134396669,
>> 201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,
>> 9999999
>>
>> Step 2:I make a carbon table qqdata2.c_indextest1:
>> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
>> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE
>> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")
>>
>> Step 3:Insert data:
>> cc.sql("insert into qqdata2.c_indextest1 select * from
>> qqdata2.h_indextest1").show(100,false);
>>
>> Step 4: Repeat from step1 to step 3, I make another carbon table
>> qqdata2.c_indextest2
>> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> qqdata2.c_indextest2 is 1700w.
>>
>>
>>
>> yixu2001
>>
>> From: sounak
>> Date: 2017-10-19 14:13
>> To: dev
>> Subject: Re: Update statement failed with "Multiple input rows matched
>> for
>> same row" in version 1.2.0,
>> Hi Yixu,
>>
>> Can you please share the DDLs and the Data for the above problem with us?
>>
>> Thanks
>> Sounak
>>
>> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;
 
> yixu2001@
 
> &gt; wrote:
>>
>> > dev
>> >
>> >
>> > In carbondata version 1.2.0, I execute "update" statement with
>> sub-query,
>> > it failed.
>> > All the rows in the 2 tables are not duplicated, and the same statement
>> > will succeed in carbondata version 1.1.1.
>> >
>> > The test log as following:
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest1").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |300000  |300000            |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest2").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |71223220|71223220          |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("update qqdata2.c_indextest2 a
>> set(a.CUST_ORDER_ID,a.ORDER_
>> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID)
>> =
>> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> SHARDING_ID,b.ORDER_ATTR_ID
>> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
>> > plan since it was too large. This behavior can be adjusted by setting
>> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >
>> > ][bigdata][Thread-1]Delete data operation is failed for
>> > qqdata2.c_indextest2
>> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
>> > failed due to failure in creating delete delta file for segment : null
>> > block : null
>> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> update
>> > operationjava.lang.Exception: Multiple input rows matched for same row.
>> > java.lang.RuntimeException: Update operation failed. Multiple input
>> rows
>> > matched for same row.
>> >   at scala.sys.package$.error(package.scala:27)
>> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> > processData(IUDCommands.scala:239)
>> >   at
>> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> > IUDCommands.scala:141)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult$lzycompute(commands.scala:58)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult(commands.scala:56)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > executeTake(commands.scala:71)
>> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> > executeCollect(limit.scala:38)
>> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
>> > SQLExecution.scala:57)
>> >   at
>> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2120)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2119)
>> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
>> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >   ... 50 elided
>> >
>> >
>> >
>> > yixu2001
>> >
>>
>>
>>
>> --
>> Thanks
>> Sounak
>>
>  
>  
>  
> -- 
> Thanks
> Sounak
 
 
 
 
 
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by Liang Chen <ch...@gmail.com>.

Hi

Execute the below query, return one row record or multiple row records ?
---------------------------------
select a.remark from  c_indextest1 a where a.id=b.id 

Regards
Liang


yixu2001 wrote
> dev 
>  You can follow the steps below to reproduce the problem.
> tables c_indextest2 has 1700w records and table c_indextest1 has about 30w
> records.
> 
> step 1:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 2: origin_data is a existing table with 1700w records, and fields in
> the table do not matter.
> cc.sql("insert into c_indextest2 select row_number() over(partition by
> a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd' remark
> from origin_data limit 17000000) a").show;
> 
> step 3:
> cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark string) 
> STORED BY 'carbondata'").show;
> 
> step 4:
> cc.sql("insert into c_indextest1 select * from  c_indextest2 where 
> pmod(cast(ID as int),50)=43").show;
> 
> setp 5:
> cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from 
> c_indextest1 a where a.id=b.id)").show;
> 
> 
> yixu2001
>  
> From: sounak
> Date: 2017-10-19 18:26
> To: dev
> Subject: Re: Re: Update statement failed with "Multiple input rows matched
> for same row" in version 1.2.0,
> Hi Yixu,
>  
> Thanks for the ddl. But i am particularly interested on the data in order
> to reproduce the problem. Will it be feasible to share to data or if it is
> huge then the scripts that generates it? I cannot access the data location
> from public network.
>  
>  
> On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 &lt;

> yixu2001@

> &gt; wrote:
>  
>> dev
>>
>> Step 1:I make a hive table qqdata2.h_indextest1:
>> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
>> `id` INT,
>> `CUST_ORDER_ID` STRING,
>> `ORDER_ITEM_IDATTR_ID` STRING,
>> `ATTR_VALUE_IDATTR_VALUE` STRING,
>> `CREATE_DATE` STRING,
>> `UPDATE_DATE` STRING,
>> `STATUS_CD` STRING,
>> `STATUS_DATE` STRING,
>> `AREA_ID` STRING,
>> `REGION_CD` STRING,
>> `UPDATE_STAFF` STRING,
>> `CREATE_STAFF` STRING,
>> `SHARDING_ID` STRING,
>> `ORDER_ATTR_ID` STRING )
>> ROW FORMAT DELIMITED
>>   FIELDS TERMINATED BY '\u0001'
>>   LINES TERMINATED BY '\n'
>>   NULL DEFINED AS ''
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.mapred.TextInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io
>>
>> .HiveIgnoreKeyTextOutputFormat'
>> LOCATION
>>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>>
>> ';
>>
>> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
>> format in the csv file just as following
>>
>> 1939505130,171483932,287305502,813463930,20160709134396669,
>> 201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,
>> 9999999
>>
>> Step 2:I make a carbon table qqdata2.c_indextest1:
>> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
>> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
>> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE
>> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
>> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")
>>
>> Step 3:Insert data:
>> cc.sql("insert into qqdata2.c_indextest1 select * from
>> qqdata2.h_indextest1").show(100,false);
>>
>> Step 4: Repeat from step1 to step 3, I make another carbon table
>> qqdata2.c_indextest2
>> The record number of qqdata2.c_indextest1 is 30w, the record number of
>> qqdata2.c_indextest2 is 1700w.
>>
>>
>>
>> yixu2001
>>
>> From: sounak
>> Date: 2017-10-19 14:13
>> To: dev
>> Subject: Re: Update statement failed with "Multiple input rows matched
>> for
>> same row" in version 1.2.0,
>> Hi Yixu,
>>
>> Can you please share the DDLs and the Data for the above problem with us?
>>
>> Thanks
>> Sounak
>>
>> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 &lt;

> yixu2001@

> &gt; wrote:
>>
>> > dev
>> >
>> >
>> > In carbondata version 1.2.0, I execute "update" statement with
>> sub-query,
>> > it failed.
>> > All the rows in the 2 tables are not duplicated, and the same statement
>> > will succeed in carbondata version 1.1.1.
>> >
>> > The test log as following:
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest1").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |300000  |300000            |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("select count(*), count(distinct id) from
>> > qqdata2.c_indextest2").show(100,false);
>> > +--------+------------------+
>> > |count(1)|count(DISTINCT id)|
>> > +--------+------------------+
>> > |71223220|71223220          |
>> > +--------+------------------+
>> >
>> > scala> cc.sql("update qqdata2.c_indextest2 a
>> set(a.CUST_ORDER_ID,a.ORDER_
>> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
>> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
>> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID)
>> =
>> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
>> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
>> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
>> SHARDING_ID,b.ORDER_ATTR_ID
>> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
>> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
>> > plan since it was too large. This behavior can be adjusted by setting
>> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
>> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>> >
>> > ][bigdata][Thread-1]Delete data operation is failed for
>> > qqdata2.c_indextest2
>> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
>> > failed due to failure in creating delete delta file for segment : null
>> > block : null
>> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
>> update
>> > operationjava.lang.Exception: Multiple input rows matched for same row.
>> > java.lang.RuntimeException: Update operation failed. Multiple input
>> rows
>> > matched for same row.
>> >   at scala.sys.package$.error(package.scala:27)
>> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
>> > processData(IUDCommands.scala:239)
>> >   at
>> org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
>> > IUDCommands.scala:141)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult$lzycompute(commands.scala:58)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > sideEffectResult(commands.scala:56)
>> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> > executeTake(commands.scala:71)
>> >   at org.apache.spark.sql.execution.CollectLimitExec.
>> > executeCollect(limit.scala:38)
>> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
>> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
>> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
>> > SQLExecution.scala:57)
>> >   at
>> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>> >   at org.apache.spark.sql.Dataset.org
>> >
>> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2120)
>> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
>> > scala:2119)
>> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
>> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>> >   ... 50 elided
>> >
>> >
>> >
>> > yixu2001
>> >
>>
>>
>>
>> --
>> Thanks
>> Sounak
>>
>  
>  
>  
> -- 
> Thanks
> Sounak





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 
 You can follow the steps below to reproduce the problem.
tables c_indextest2 has 1700w records and table c_indextest1 has about 30w records.

step 1:
cc.sql("CREATE TABLE IF NOT EXISTS c_indextest2 (id string ,remark string)  STORED BY 'carbondata'").show;

step 2: origin_data is a existing table with 1700w records, and fields in the table do not matter.
cc.sql("insert into c_indextest2 select row_number() over(partition by a.PKid order by a.pkid) id,a.remark from  (SELECT '1' PKID,'dfsdd' remark from origin_data limit 17000000) a").show;

step 3:
cc.sql("CREATE TABLE IF NOT EXISTS c_indextest1 (id string ,remark string)  STORED BY 'carbondata'").show;

step 4:
cc.sql("insert into c_indextest1 select * from  c_indextest2 where  pmod(cast(ID as int),50)=43").show;

setp 5:
cc.sql("update c_indextest2 b set (b.remark)=(select a.remark from  c_indextest1 a where a.id=b.id)").show;


yixu2001
 
From: sounak
Date: 2017-10-19 18:26
To: dev
Subject: Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Hi Yixu,
 
Thanks for the ddl. But i am particularly interested on the data in order
to reproduce the problem. Will it be feasible to share to data or if it is
huge then the scripts that generates it? I cannot access the data location
from public network.
 
 
On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 <yi...@163.com> wrote:
 
> dev
>
> Step 1:I make a hive table qqdata2.h_indextest1:
> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> `id` INT,
> `CUST_ORDER_ID` STRING,
> `ORDER_ITEM_IDATTR_ID` STRING,
> `ATTR_VALUE_IDATTR_VALUE` STRING,
> `CREATE_DATE` STRING,
> `UPDATE_DATE` STRING,
> `STATUS_CD` STRING,
> `STATUS_DATE` STRING,
> `AREA_ID` STRING,
> `REGION_CD` STRING,
> `UPDATE_STAFF` STRING,
> `CREATE_STAFF` STRING,
> `SHARDING_ID` STRING,
> `ORDER_ATTR_ID` STRING )
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '\u0001'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS ''
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io
>
> .HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>
> ';
>
> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
> format in the csv file just as following
>
> 1939505130,171483932,287305502,813463930,20160709134396669,
> 201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,
> 9999999
>
> Step 2:I make a carbon table qqdata2.c_indextest1:
> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE
> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")
>
> Step 3:Insert data:
> cc.sql("insert into qqdata2.c_indextest1 select * from
> qqdata2.h_indextest1").show(100,false);
>
> Step 4: Repeat from step1 to step 3, I make another carbon table
> qqdata2.c_indextest2
> The record number of qqdata2.c_indextest1 is 30w, the record number of
> qqdata2.c_indextest2 is 1700w.
>
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-19 14:13
> To: dev
> Subject: Re: Update statement failed with "Multiple input rows matched for
> same row" in version 1.2.0,
> Hi Yixu,
>
> Can you please share the DDLs and the Data for the above problem with us?
>
> Thanks
> Sounak
>
> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> >
> > In carbondata version 1.2.0, I execute "update" statement with sub-query,
> > it failed.
> > All the rows in the 2 tables are not duplicated, and the same statement
> > will succeed in carbondata version 1.1.1.
> >
> > The test log as following:
> > scala> cc.sql("select count(*), count(distinct id) from
> > qqdata2.c_indextest1").show(100,false);
> > +--------+------------------+
> > |count(1)|count(DISTINCT id)|
> > +--------+------------------+
> > |300000  |300000            |
> > +--------+------------------+
> >
> > scala> cc.sql("select count(*), count(distinct id) from
> > qqdata2.c_indextest2").show(100,false);
> > +--------+------------------+
> > |count(1)|count(DISTINCT id)|
> > +--------+------------------+
> > |71223220|71223220          |
> > +--------+------------------+
> >
> > scala> cc.sql("update qqdata2.c_indextest2 a set(a.CUST_ORDER_ID,a.ORDER_
> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID) =
> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> SHARDING_ID,b.ORDER_ATTR_ID
> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
> > plan since it was too large. This behavior can be adjusted by setting
> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> >
> > ][bigdata][Thread-1]Delete data operation is failed for
> > qqdata2.c_indextest2
> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
> > failed due to failure in creating delete delta file for segment : null
> > block : null
> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
> update
> > operationjava.lang.Exception: Multiple input rows matched for same row.
> > java.lang.RuntimeException: Update operation failed. Multiple input rows
> > matched for same row.
> >   at scala.sys.package$.error(package.scala:27)
> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
> > processData(IUDCommands.scala:239)
> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > IUDCommands.scala:141)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > sideEffectResult$lzycompute(commands.scala:58)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > sideEffectResult(commands.scala:56)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > executeTake(commands.scala:71)
> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > executeCollect(limit.scala:38)
> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> > SQLExecution.scala:57)
> >   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> >   at org.apache.spark.sql.Dataset.org
> >
> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> >   at org.apache.spark.sql.Dataset.org
> >
> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > scala:2120)
> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > scala:2119)
> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> >   ... 50 elided
> >
> >
> >
> > yixu2001
> >
>
>
>
> --
> Thanks
> Sounak
>
 
 
 
-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

Hi Yixu,

Thanks for the ddl. But i am particularly interested on the data in order
to reproduce the problem. Will it be feasible to share to data or if it is
huge then the scripts that generates it? I cannot access the data location
from public network.


On Thu, Oct 19, 2017 at 3:00 PM, yixu2001 <yi...@163.com> wrote:

> dev
>
> Step 1:I make a hive table qqdata2.h_indextest1:
> CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
> `id` INT,
> `CUST_ORDER_ID` STRING,
> `ORDER_ITEM_IDATTR_ID` STRING,
> `ATTR_VALUE_IDATTR_VALUE` STRING,
> `CREATE_DATE` STRING,
> `UPDATE_DATE` STRING,
> `STATUS_CD` STRING,
> `STATUS_DATE` STRING,
> `AREA_ID` STRING,
> `REGION_CD` STRING,
> `UPDATE_STAFF` STRING,
> `CREATE_STAFF` STRING,
> `SHARDING_ID` STRING,
> `ORDER_ATTR_ID` STRING )
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY '\u0001'
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS ''
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io
>
> .HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1
>
> ';
>
> there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data
> format in the csv file just as following
>
> 1939505130,171483932,287305502,813463930,20160709134396669,
> 201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,
> 9999999
>
> Step 2:I make a carbon table qqdata2.c_indextest1:
> cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING,
> CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE
> STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE
> STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF
> STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")
>
> Step 3:Insert data:
> cc.sql("insert into qqdata2.c_indextest1 select * from
> qqdata2.h_indextest1").show(100,false);
>
> Step 4: Repeat from step1 to step 3, I make another carbon table
> qqdata2.c_indextest2
> The record number of qqdata2.c_indextest1 is 30w, the record number of
> qqdata2.c_indextest2 is 1700w.
>
>
>
> yixu2001
>
> From: sounak
> Date: 2017-10-19 14:13
> To: dev
> Subject: Re: Update statement failed with "Multiple input rows matched for
> same row" in version 1.2.0,
> Hi Yixu,
>
> Can you please share the DDLs and the Data for the above problem with us?
>
> Thanks
> Sounak
>
> On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 <yi...@163.com> wrote:
>
> > dev
> >
> >
> > In carbondata version 1.2.0, I execute "update" statement with sub-query,
> > it failed.
> > All the rows in the 2 tables are not duplicated, and the same statement
> > will succeed in carbondata version 1.1.1.
> >
> > The test log as following:
> > scala> cc.sql("select count(*), count(distinct id) from
> > qqdata2.c_indextest1").show(100,false);
> > +--------+------------------+
> > |count(1)|count(DISTINCT id)|
> > +--------+------------------+
> > |300000  |300000            |
> > +--------+------------------+
> >
> > scala> cc.sql("select count(*), count(distinct id) from
> > qqdata2.c_indextest2").show(100,false);
> > +--------+------------------+
> > |count(1)|count(DISTINCT id)|
> > +--------+------------------+
> > |71223220|71223220          |
> > +--------+------------------+
> >
> > scala> cc.sql("update qqdata2.c_indextest2 a set(a.CUST_ORDER_ID,a.ORDER_
> > ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> > UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> > REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID) =
> > (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
> > VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> > DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.
> SHARDING_ID,b.ORDER_ATTR_ID
> > from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> > 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
> > plan since it was too large. This behavior can be adjusted by setting
> > 'spark.debug.maxToStringFields' in SparkEnv.conf.
> > 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
> >
> > ][bigdata][Thread-1]Delete data operation is failed for
> > qqdata2.c_indextest2
> > 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
> > failed due to failure in creating delete delta file for segment : null
> > block : null
> > 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in
> update
> > operationjava.lang.Exception: Multiple input rows matched for same row.
> > java.lang.RuntimeException: Update operation failed. Multiple input rows
> > matched for same row.
> >   at scala.sys.package$.error(package.scala:27)
> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
> > processData(IUDCommands.scala:239)
> >   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> > IUDCommands.scala:141)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > sideEffectResult$lzycompute(commands.scala:58)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > sideEffectResult(commands.scala:56)
> >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> > executeTake(commands.scala:71)
> >   at org.apache.spark.sql.execution.CollectLimitExec.
> > executeCollect(limit.scala:38)
> >   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> > Dataset$$execute$1$1.apply(Dataset.scala:2378)
> >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> > SQLExecution.scala:57)
> >   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
> >   at org.apache.spark.sql.Dataset.org
> >
> > $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
> >   at org.apache.spark.sql.Dataset.org
> >
> > $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > scala:2120)
> >   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> > scala:2119)
> >   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
> >   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
> >   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
> >   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
> >   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
> >   ... 50 elided
> >
> >
> >
> > yixu2001
> >
>
>
>
> --
> Thanks
> Sounak
>



-- 
Thanks
Sounak

Re: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by yixu2001 <yi...@163.com>.

dev 

Step 1:I make a hive table qqdata2.h_indextest1:
CREATE EXTERNAL TABLE `qqdata2.h_indextest1`(
`id` INT, 
`CUST_ORDER_ID` STRING,
`ORDER_ITEM_IDATTR_ID` STRING,
`ATTR_VALUE_IDATTR_VALUE` STRING,
`CREATE_DATE` STRING,
`UPDATE_DATE` STRING,
`STATUS_CD` STRING,
`STATUS_DATE` STRING,
`AREA_ID` STRING,
`REGION_CD` STRING,
`UPDATE_STAFF` STRING,
`CREATE_STAFF` STRING,
`SHARDING_ID` STRING,
`ORDER_ATTR_ID` STRING )
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\u0001'
  LINES TERMINATED BY '\n'
  NULL DEFINED AS ''
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io 

.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hdp78.ffcs.cn:8020/user/bigdata/streamcql/dist1 

';

there are csv files in hdfs dir user/bigdata/streamcql/dist1, the data format in the csv file just as following

1939505130,171483932,287305502,813463930,20160709134396669,201607101469099594,1299,20160711996575390,10,73,302063,302064,127859875,9999999

Step 2:I make a carbon table qqdata2.c_indextest1:
cc.sql("CREATE TABLE IF NOT EXISTS qqdata2.c_indextest1 (id STRING, CUST_ORDER_ID STRING,ORDER_ITEM_IDATTR_ID STRING,ATTR_VALUE_IDATTR_VALUE STRING,CREATE_DATE STRING,UPDATE_DATE STRING,STATUS_CD STRING,STATUS_DATE STRING,AREA_ID STRING,REGION_CD STRING,UPDATE_STAFF STRING,CREATE_STAFF STRING,SHARDING_ID STRING,ORDER_ATTR_ID STRING) STORED BY 'carbondata' ")

Step 3:Insert data:
cc.sql("insert into qqdata2.c_indextest1 select * from qqdata2.h_indextest1").show(100,false);

Step 4: Repeat from step1 to step 3, I make another carbon table qqdata2.c_indextest2
The record number of qqdata2.c_indextest1 is 30w, the record number of qqdata2.c_indextest2 is 1700w.
 


yixu2001
 
From: sounak
Date: 2017-10-19 14:13
To: dev
Subject: Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,
Hi Yixu,
 
Can you please share the DDLs and the Data for the above problem with us?
 
Thanks
Sounak
 
On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 <yi...@163.com> wrote:
 
> dev
>
>
> In carbondata version 1.2.0, I execute "update" statement with sub-query,
> it failed.
> All the rows in the 2 tables are not duplicated, and the same statement
> will succeed in carbondata version 1.1.1.
>
> The test log as following:
> scala> cc.sql("select count(*), count(distinct id) from
> qqdata2.c_indextest1").show(100,false);
> +--------+------------------+
> |count(1)|count(DISTINCT id)|
> +--------+------------------+
> |300000  |300000            |
> +--------+------------------+
>
> scala> cc.sql("select count(*), count(distinct id) from
> qqdata2.c_indextest2").show(100,false);
> +--------+------------------+
> |count(1)|count(DISTINCT id)|
> +--------+------------------+
> |71223220|71223220          |
> +--------+------------------+
>
> scala> cc.sql("update qqdata2.c_indextest2 a set(a.CUST_ORDER_ID,a.ORDER_
> ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID) =
> (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
> VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.SHARDING_ID,b.ORDER_ATTR_ID
> from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
> plan since it was too large. This behavior can be adjusted by setting
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>
> ][bigdata][Thread-1]Delete data operation is failed for
> qqdata2.c_indextest2
> 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
> failed due to failure in creating delete delta file for segment : null
> block : null
> 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in update
> operationjava.lang.Exception: Multiple input rows matched for same row.
> java.lang.RuntimeException: Update operation failed. Multiple input rows
> matched for same row.
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
> processData(IUDCommands.scala:239)
>   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> IUDCommands.scala:141)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:58)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:56)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> executeTake(commands.scala:71)
>   at org.apache.spark.sql.execution.CollectLimitExec.
> executeCollect(limit.scala:38)
>   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> Dataset$$execute$1$1.apply(Dataset.scala:2378)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>   at org.apache.spark.sql.Dataset.org
>
> $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset.org
>
> $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> scala:2120)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> scala:2119)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>   ... 50 elided
>
>
>
> yixu2001
>
 
 
 
-- 
Thanks
Sounak

Re: Update statement failed with "Multiple input rows matched for same row" in version 1.2.0,

Posted by sounak <so...@gmail.com>.

Hi Yixu,

Can you please share the DDLs and the Data for the above problem with us?

Thanks
Sounak

On Wed, Oct 18, 2017 at 12:44 PM, yixu2001 <yi...@163.com> wrote:

> dev
>
>
> In carbondata version 1.2.0, I execute "update" statement with sub-query,
> it failed.
> All the rows in the 2 tables are not duplicated, and the same statement
> will succeed in carbondata version 1.1.1.
>
> The test log as following:
> scala> cc.sql("select count(*), count(distinct id) from
> qqdata2.c_indextest1").show(100,false);
> +--------+------------------+
> |count(1)|count(DISTINCT id)|
> +--------+------------------+
> |300000  |300000            |
> +--------+------------------+
>
> scala> cc.sql("select count(*), count(distinct id) from
> qqdata2.c_indextest2").show(100,false);
> +--------+------------------+
> |count(1)|count(DISTINCT id)|
> +--------+------------------+
> |71223220|71223220          |
> +--------+------------------+
>
> scala> cc.sql("update qqdata2.c_indextest2 a set(a.CUST_ORDER_ID,a.ORDER_
> ITEM_IDATTR_ID,a.ATTR_VALUE_IDATTR_VALUE,a.CREATE_DATE,a.
> UPDATE_DATE,a.STATUS_CD,a.STATUS_DATE,a.AREA_ID,a.
> REGION_CD,a.UPDATE_STAFF,a.CREATE_STAFF,a.SHARDING_ID,a.ORDER_ATTR_ID) =
> (select b.CUST_ORDER_ID,b.ORDER_ITEM_IDATTR_ID,b.ATTR_VALUE_IDATTR_
> VALUE,b.CREATE_DATE,b.UPDATE_DATE,b.STATUS_CD,b.STATUS_
> DATE,b.AREA_ID,b.REGION_CD,b.UPDATE_STAFF,b.CREATE_STAFF,b.SHARDING_ID,b.ORDER_ATTR_ID
> from qqdata2.c_indextest1 b where a.id = b.id)").show(100,false);
> 17/10/18 11:32:46 WARN Utils: Truncated the string representation of a
> plan since it was too large. This behavior can be adjusted by setting
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 17/10/18 11:33:20 AUDIT deleteExecution$: [hdp84.ffcs.cn
>
> ][bigdata][Thread-1]Delete data operation is failed for
> qqdata2.c_indextest2
> 17/10/18 11:33:20 ERROR deleteExecution$: main Delete data operation is
> failed due to failure in creating delete delta file for segment : null
> block : null
> 17/10/18 11:33:20 ERROR ProjectForUpdateCommand$: main Exception in update
> operationjava.lang.Exception: Multiple input rows matched for same row.
> java.lang.RuntimeException: Update operation failed. Multiple input rows
> matched for same row.
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.
> processData(IUDCommands.scala:239)
>   at org.apache.spark.sql.execution.command.ProjectForUpdateCommand.run(
> IUDCommands.scala:141)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:58)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:56)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.
> executeTake(commands.scala:71)
>   at org.apache.spark.sql.execution.CollectLimitExec.
> executeCollect(limit.scala:38)
>   at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
> Dataset$$execute$1$1.apply(Dataset.scala:2378)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
>   at org.apache.spark.sql.Dataset.org
>
> $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset.org
>
> $apache$spark$sql$Dataset$$collect(Dataset.scala:2384)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> scala:2120)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.
> scala:2119)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:640)
>   ... 50 elided
>
>
>
> yixu2001
>



-- 
Thanks
Sounak