You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "Sangeeta Gulia (JIRA)" <ji...@apache.org> on 2018/02/01 07:37:00 UTC
[jira] [Created] (CARBONDATA-2112) Data getting garbled after datamap creation when table is created with GLOBAL SORT

Sangeeta Gulia created CARBONDATA-2112:
------------------------------------------

             Summary: Data getting garbled after datamap creation when table is created with GLOBAL SORT
                 Key: CARBONDATA-2112
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2112
             Project: CarbonData
          Issue Type: Bug
          Components: data-query
         Environment: spark-2.1
            Reporter: Sangeeta Gulia
         Attachments: 2000_UniqData.csv

Data is getting garbled after datamap creation when table is created with BATCH_SORT/GLOBAL_SORT.

 

Steps to reproduce :

spark.sql("drop table if exists uniqdata_batchsort_compact3")

spark.sql("CREATE TABLE uniqdata_batchsort_compact3 (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY 'carbondata' TBLPROPERTIES('SORT_SCOPE'='GLOBAL_SORT')").show()

spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " +
 "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," +
 "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," +
 "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," +
 "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')")

spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " +
 "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," +
 "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," +
 "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," +
 "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')")

spark.sql("LOAD DATA INPATH '/home/sangeeta/Desktop/2000_UniqData.csv' into table " +
 "uniqdata_batchsort_compact3 OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='\"'," +
 "'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION," +
 "DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2," +
 "Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1','batch_sort_size_inmb'='1')")

spark.sql("select cust_id, avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id ").show(50)

+-------+------------+
|cust_id|avg(cust_id)|
+-------+------------+
| 9376| 9376.0|
| 9427| 9427.0|
| 9465| 9465.0|
| 9852| 9852.0|
| 9900| 9900.0|
| 10206| 10206.0|
| 10362| 10362.0|
| 10623| 10623.0|
| 10817| 10817.0|
| 9182| 9182.0|
| 9564| 9564.0|
| 9879| 9879.0|
| 10081| 10081.0|
| 10121| 10121.0|
| 10230| 10230.0|
| 10462| 10462.0|
| 10703| 10703.0|
| 10914| 10914.0|
| 9162| 9162.0|
| 9383| 9383.0|
| 9454| 9454.0|
| 9517| 9517.0|
| 9558| 9558.0|
| 10708| 10708.0|
| 10798| 10798.0|
| 10862| 10862.0|
| 9071| 9071.0|
| 9169| 9169.0|
| 9946| 9946.0|
| 10468| 10468.0|
| 10745| 10745.0|
| 10768| 10768.0|
| 9153| 9153.0|
| 9206| 9206.0|
| 9403| 9403.0|
| 9597| 9597.0|
| 9647| 9647.0|
| 9775| 9775.0|
| 10032| 10032.0|
| 10395| 10395.0|
| 10527| 10527.0|
| 10567| 10567.0|
| 10632| 10632.0|
| 10788| 10788.0|
| 10815| 10815.0|
| 10840| 10840.0|
| 9181| 9181.0|
| 9344| 9344.0|
| 9575| 9575.0|
| 9675| 9675.0|
+-------+------------+
only showing top 50 rows

Note: Here the cust_id is coming correct .


spark.sql("create datamap uniqdata_agg on table uniqdata_batchsort_compact3 using " +
 "'preaggregate' as select avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id")

spark.sql("select cust_id, avg(cust_id) from uniqdata_batchsort_compact3 group by cust_id ").show(50)

+-------+------------+
|cust_id|avg(cust_id)|
+-------+------------+
| 27651| 9217.0|
| 31944| 10648.0|
| 32667| 10889.0|
| 28242| 9414.0|
| 29841| 9947.0|
| 28728| 9576.0|
| 27255| 9085.0|
| 32571| 10857.0|
| 30276| 10092.0|
| 27276| 9092.0|
| 31503| 10501.0|
| 27687| 9229.0|
| 27183| 9061.0|
| 29334| 9778.0|
| 29913| 9971.0|
| 28683| 9561.0|
| 31545| 10515.0|
| 30405| 10135.0|
| 27693| 9231.0|
| 29649| 9883.0|
| 30537| 10179.0|
| 32709| 10903.0|
| 29586| 9862.0|
| 32895| 10965.0|
| 32415| 10805.0|
| 31644| 10548.0|
| 30030| 10010.0|
| 31713| 10571.0|
| 28083| 9361.0|
| 27813| 9271.0|
| 27171| 9057.0|
| 27189| 9063.0|
| 30444| 10148.0|
| 28623| 9541.0|
| 28566| 9522.0|
| 32655| 10885.0|
| 31164| 10388.0|
| 30321| 10107.0|
| 31452| 10484.0|
| 29829| 9943.0|
| 27468| 9156.0|
| 31212| 10404.0|
| 32154| 10718.0|
| 27531| 9177.0|
| 27654| 9218.0|
| 27105| 9035.0|
| 31113| 10371.0|
| 28479| 9493.0|
| 29094| 9698.0|
| 31551| 10517.0|
+-------+------------+
only showing top 50 rows

Note: But after datamap creation, cust_id is coming incorrect. It is coming as thrice(equivalent to number of loads) of its original value and avg(cust_id) is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)