You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/05/14 01:51:59 UTC
[jira] [Comment Edited] (DRILL-3062) regression: Mondrian
query447.q - lots of rows missing in result set
[ https://issues.apache.org/jira/browse/DRILL-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542930#comment-14542930 ]
Aman Sinha edited comment on DRILL-3062 at 5/13/15 11:51 PM:
-------------------------------------------------------------
The strings literals in the Values POP in the physical plan have trailing blank spaces which is causing the wrong results. Here's a simple example:
{code}
0: jdbc:drill:zk=local> select cast(n_name as varchar(20)) from cp.`tpch/nation.parquet` where n_name in ('ALGERIA', 'ARGENTINA', 'BRAZIL', 'CANADA', 'EGYPT', 'ETHIOPIA', 'FRANCE', 'GERMANY', 'INDIA', 'INDONESIA', 'IRAN', 'IRAQ', 'JAPAN', 'JORDAN', 'KENYA', 'MOROCCO', 'MOZAMBIQUE', 'PERU', 'CHINA', 'ROMANIA', 'SAUDI ARABIA', 'VIETNAM');
+------------+
| EXPR$0 |
+------------+
| SAUDI ARABIA |
+------------+
{code}
Note that "SAUDI ARABIA" is the longest string.
Here's the plan that shows the blank padding to make them the same length as the longest string.
{code}
}, {
"pop" : "Values",
"@id" : 196611,
"content" : [ {
"ROW_VALUE" : "ALGERIA "
}, {
"ROW_VALUE" : "ARGENTINA "
}, {
"ROW_VALUE" : "BRAZIL "
}, {
"ROW_VALUE" : "CANADA "
}, {
"ROW_VALUE" : "EGYPT "
}, {
"ROW_VALUE" : "ETHIOPIA "
}, {
"ROW_VALUE" : "FRANCE "
}, {
"ROW_VALUE" : "GERMANY "
}, {
"ROW_VALUE" : "INDIA "
}, {
"ROW_VALUE" : "INDONESIA "
}, {
"ROW_VALUE" : "IRAN "
}, {
"ROW_VALUE" : "IRAQ "
}, {
"ROW_VALUE" : "JAPAN "
}, {
"ROW_VALUE" : "JORDAN "
}, {
"ROW_VALUE" : "KENYA "
}, {
"ROW_VALUE" : "MOROCCO "
}, {
"ROW_VALUE" : "MOZAMBIQUE "
}, {
"ROW_VALUE" : "PERU "
}, {
"ROW_VALUE" : "CHINA "
}, {
"ROW_VALUE" : "ROMANIA "
}, {
"ROW_VALUE" : "SAUDI ARABIA"
}, {
"ROW_VALUE" : "VIETNAM "
} ],
"initialAllocation" : 1000000,
"maxAllocation" : 10000000000,
"cost" : 1.0
},
{code}
was (Author: amansinha100):
The strings literals in the Values POP in the physical plan have trailing blank spaces which is causing the wrong results. Here's a simple example:
{code}
0: jdbc:drill:zk=local> select cast(n_name as varchar(20)) from cp.`tpch/nation.parquet` where n_name in ('ALGERIA', 'ARGENTINA', 'BRAZIL', 'CANADA', 'EGYPT', 'ETHIOPIA', 'FRANCE', 'GERMANY', 'INDIA', 'INDONESIA', 'IRAN', 'IRAQ', 'JAPAN', 'JORDAN', 'KENYA', 'MOROCCO', 'MOZAMBIQUE', 'PERU', 'CHINA', 'ROMANIA', 'SAUDI ARABIA', 'VIETNAM');
+------------+
| EXPR$0 |
+------------+
| SAUDI ARABIA |
+------------+
{code}
Note that "SAUDI ARABIA" is the longest string.
Here's the plan that shows the blank padding to make them the same length as the longest string.
{code}
}, {
"pop" : "Values",
"@id" : 196611,
"content" : [ {
"ROW_VALUE" : "ALGERIA "
}, {
"ROW_VALUE" : "ARGENTINA "
}, {
"ROW_VALUE" : "BRAZIL "
}, {
"ROW_VALUE" : "CANADA "
}, {
"ROW_VALUE" : "EGYPT "
}, {
"ROW_VALUE" : "ETHIOPIA "
}, {
"ROW_VALUE" : "FRANCE "
}, {
"ROW_VALUE" : "GERMANY "
}, {
"ROW_VALUE" : "INDIA "
}, {
"ROW_VALUE" : "INDONESIA "
}, {
"ROW_VALUE" : "IRAN "
}, {
"ROW_VALUE" : "IRAQ "
}, {
"ROW_VALUE" : "JAPAN "
}, {
"ROW_VALUE" : "JORDAN "
}, {
"ROW_VALUE" : "KENYA "
}, {
"ROW_VALUE" : "MOROCCO "
}, {
"ROW_VALUE" : "MOZAMBIQUE "
}, {
"ROW_VALUE" : "PERU "
}, {
"ROW_VALUE" : "CHINA "
}, {
"ROW_VALUE" : "ROMANIA "
}, {
"ROW_VALUE" : "SAUDI ARABIA"
}, {
"ROW_VALUE" : "VIETNAM "
} ],
"initialAllocation" : 1000000,
"maxAllocation" : 10000000000,
"cost" : 1.0
},
> regression: Mondrian query447.q - lots of rows missing in result set
> --------------------------------------------------------------------
>
> Key: DRILL-3062
> URL: https://issues.apache.org/jira/browse/DRILL-3062
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Data Types
> Affects Versions: 1.0.0
> Reporter: Chun Chang
> Assignee: Aman Sinha
> Priority: Blocker
> Fix For: 1.0.0
>
>
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select * from sys.version;
> +------------+----------------+-------------+-------------+------------+
> | commit_id | commit_message | commit_time | build_email | build_time |
> +------------+----------------+-------------+-------------+------------+
> | d1526f9462f6817a76631464ff332bb99b3bdf28 | DRILL-2750: Running 1 or more queries against Drillbits having insufficient DirectMem renders the Drillbits in an unusable state | 13.05.2015 @ 08:47:20 EDT | Unknown | 13.05.2015 @ 10:44:43 EDT |
> +------------+----------------+-------------+-------------+------------+
> {code}
> Many (total of 42) mondrian queries regressed. All of them missing rows in the returned result set.
> Here is an example, query447.q
> {code}
> SELECT time_by_day.the_year AS c0,
> product_class.product_family AS c1,
> customer.state_province AS c2,
> customer.city AS c3,
> Sum(sales_fact_1997.unit_sales) AS m0
> FROM time_by_day AS time_by_day,
> sales_fact_1997 AS sales_fact_1997,
> product_class AS product_class,
> product AS product,
> customer AS customer
> WHERE sales_fact_1997.time_id = time_by_day.time_id
> AND time_by_day.the_year = 1997
> AND sales_fact_1997.product_id = product.product_id
> AND product.product_class_id = product_class.product_class_id
> AND product_class.product_family = 'Drink'
> AND sales_fact_1997.customer_id = customer.customer_id
> AND customer.state_province = 'WA'
> AND customer.city IN ( 'Anacortes', 'Ballard', 'Bellingham', 'Bremerton',
> 'Burien', 'Edmonds', 'Everett', 'Issaquah',
> 'Kirkland', 'Lynnwood', 'Marysville', 'Olympia',
> 'Port Orchard', 'Puyallup', 'Redmond', 'Renton',
> 'Seattle', 'Sedro Woolley', 'Spokane', 'Tacoma',
> 'Walla Walla', 'Yakima' )
> GROUP BY time_by_day.the_year,
> product_class.product_family,
> customer.state_province,
> customer.city;
> {code}
> This query should return the following result:
> {code}
> [root@qa-node120 mondrian]# cat query447.e
> 1997 Drink WA Walla Walla 191.0000
> 1997 Drink WA Issaquah 203.0000
> 1997 Drink WA Everett 208.0000
> 1997 Drink WA Olympia 1066.0000
> 1997 Drink WA Edmonds 166.0000
> 1997 Drink WA Bremerton 1160.0000
> 1997 Drink WA Renton 225.0000
> 1997 Drink WA Bellingham 68.0000
> 1997 Drink WA Ballard 214.0000
> 1997 Drink WA Burien 251.0000
> 1997 Drink WA Seattle 168.0000
> 1997 Drink WA Redmond 137.0000
> 1997 Drink WA Lynnwood 201.0000
> 1997 Drink WA Puyallup 1040.0000
> 1997 Drink WA Tacoma 986.0000
> 1997 Drink WA Kirkland 247.0000
> 1997 Drink WA Sedro Woolley 58.0000
> 1997 Drink WA Yakima 1159.0000
> 1997 Drink WA Port Orchard 1128.0000
> 1997 Drink WA Spokane 2238.0000
> 1997 Drink WA Anacortes 82.0000
> 1997 Drink WA Marysville 193.0000
> {code}
> But drill now returns:
> {code}
> 1997 Drink WA Sedro Woolley 58.0000
> {code}
> Here is the plan:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select time_by_day.the_year as c0, product_class.product_family as c1, customer.state_province as c2, customer.city as c3, sum(sales_fact_1997.unit_sales) as m0 from time_by_day as time_by_day, sales_fact_1997 as sales_fact_1997, product_class as product_class, product as product, customer as customer where sales_fact_1997.time_id = time_by_day.time_id and time_by_day.the_year = 1997 and sales_fact_1997.product_id = product.product_id and product.product_class_id = product_class.product_class_id and product_class.product_family = 'Drink' and sales_fact_1997.customer_id = customer.customer_id and customer.state_province = 'WA' and customer.city in ('Anacortes', 'Ballard', 'Bellingham', 'Bremerton', 'Burien', 'Edmonds', 'Everett', 'Issaquah', 'Kirkland', 'Lynnwood', 'Marysville', 'Olympia', 'Port Orchard', 'Puyallup', 'Redmond', 'Renton', 'Seattle', 'Sedro Woolley', 'Spokane', 'Tacoma', 'Walla Walla', 'Yakima') group by time_by_day.the_year, product_class.product_family, customer.state_province, customer.city;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(c0=[$0], c1=[$1], c2=[$2], c3=[$3], m0=[$4])
> 00-02 HashAgg(group=[{0, 1, 2, 3}], m0=[SUM($4)])
> 00-03 Project(c0=[$0], c1=[$2], c2=[$3], c3=[$4], unit_sales=[$1])
> 00-04 HashJoin(condition=[=($5, $6)], joinType=[inner])
> 00-06 Project(the_year=[$0], unit_sales=[$5], product_family=[$6], state_province=[$10], city=[$11], f17=[$11])
> 00-08 Project(the_year=[$4], time_id=[$5], time_id0=[$0], product_id=[$1], customer_id=[$2], unit_sales=[$3], product_family=[$8], product_class_id=[$9], product_id0=[$6], product_class_id0=[$7], state_province=[$10], city=[$11], customer_id0=[$12])
> 00-09 HashJoin(condition=[=($2, $12)], joinType=[inner])
> 00-11 HashJoin(condition=[=($1, $6)], joinType=[inner])
> 00-14 HashJoin(condition=[=($0, $5)], joinType=[inner])
> 00-18 Project(time_id=[$2], product_id=[$0], customer_id=[$1], unit_sales=[$3])
> 00-23 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/mondrian/sales_fact_1997]], selectionRoot=/drill/testdata/mondrian/sales_fact_1997, numFiles=1, columns=[`time_id`, `product_id`, `customer_id`, `unit_sales`]]])
> 00-17 Project(the_year=[$0], time_id0=[$1])
> 00-22 SelectionVectorRemover
> 00-26 Filter(condition=[=($0, 1997)])
> 00-28 Project(the_year=[$1], time_id=[$0])
> 00-30 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/mondrian/time_by_day]], selectionRoot=/drill/testdata/mondrian/time_by_day, numFiles=1, columns=[`the_year`, `time_id`]]])
> 00-13 Project(product_id0=[$0], product_class_id=[$1], product_family=[$2], product_class_id0=[$3])
> 00-16 HashJoin(condition=[=($1, $3)], joinType=[inner])
> 00-21 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/mondrian/product]], selectionRoot=/drill/testdata/mondrian/product, numFiles=1, columns=[`product_id`, `product_class_id`]]])
> 00-20 Project(product_family=[$0], product_class_id0=[$1])
> 00-25 SelectionVectorRemover
> 00-27 Filter(condition=[=($0, 'Drink')])
> 00-29 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/mondrian/product_class]], selectionRoot=/drill/testdata/mondrian/product_class, numFiles=1, columns=[`product_family`, `product_class_id`]]])
> 00-10 Project(state_province=[$0], city=[$1], customer_id0=[$2])
> 00-12 SelectionVectorRemover
> 00-15 Filter(condition=[=($0, 'WA')])
> 00-19 Project(state_province=[$1], city=[$2], customer_id=[$0])
> 00-24 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/mondrian/customer]], selectionRoot=/drill/testdata/mondrian/customer, numFiles=1, columns=[`state_province`, `city`, `customer_id`]]])
> 00-05 HashAgg(group=[{0}])
> 00-07 Values
> | {
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)