You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2016/09/22 04:53:20 UTC
[jira] [Updated] (DRILL-4898) wrong results : Query on directory
containing CSV data
[ https://issues.apache.org/jira/browse/DRILL-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Khurram Faraaz updated DRILL-4898:
----------------------------------
Description:
incorrect results : Query on directory containing CSV data
directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files)
Drill 1.9.0 commit ID: f3c26e34
I can share the data to reproduce the issue.
Note that data in columns[3] has the value "B02512\r" in query results.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
+----------------------------------------------------------+
| columns |
+----------------------------------------------------------+
| ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] |
| ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] |
| ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] |
| ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] |
| ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] |
+----------------------------------------------------------+
5 rows selected (0.184 seconds)
{noformat}
But when we do a select on columns[3] we see a different value in the query result.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
+----------+
| EXPR$0 |
+----------+
|02512
|02512
|02512
|02512
|02512
+----------+
5 rows selected (0.159 seconds)
{noformat}
Searching for 'B02512' returns no rows. (where as it should have returned data)
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512';
+----------+
| columns |
+----------+
+----------+
No rows selected (1.707 seconds)
{noformat}
was:
incorrect results : Query on directory containing CSV data
directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files)
Drill 1.9.0 commit ID: f3c26e34
Data is available here - /home/MAPRTECH/qa/drill/uber_trip_data
Note that data in columns[3] has the value "B02512\r" in query results.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
+----------------------------------------------------------+
| columns |
+----------------------------------------------------------+
| ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] |
| ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] |
| ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] |
| ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] |
| ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] |
+----------------------------------------------------------+
5 rows selected (0.184 seconds)
{noformat}
But when we do a select on columns[3] we see a different value in the query result.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
+----------+
| EXPR$0 |
+----------+
|02512
|02512
|02512
|02512
|02512
+----------+
5 rows selected (0.159 seconds)
{noformat}
Searching for 'B02512' returns no rows. (where as it should have returned data)
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512';
+----------+
| columns |
+----------+
+----------+
No rows selected (1.707 seconds)
{noformat}
> wrong results : Query on directory containing CSV data
> ------------------------------------------------------
>
> Key: DRILL-4898
> URL: https://issues.apache.org/jira/browse/DRILL-4898
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.9.0
> Environment: 4 node cluster
> Reporter: Khurram Faraaz
>
> incorrect results : Query on directory containing CSV data
> directory has 4534327 number of rows ~ 4.5M records (there are 6 CSV files)
> Drill 1.9.0 commit ID: f3c26e34
> I can share the data to reproduce the issue.
> Note that data in columns[3] has the value "B02512\r" in query results.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` limit 5;
> +----------------------------------------------------------+
> | columns |
> +----------------------------------------------------------+
> | ["2014-08-01 00:03:00","40.7366","-73.9906","B02512\r"] |
> | ["2014-08-01 00:09:00","40.726","-73.9918","B02512\r"] |
> | ["2014-08-01 00:12:00","40.7209","-74.0507","B02512\r"] |
> | ["2014-08-01 00:12:00","40.7387","-73.9856","B02512\r"] |
> | ["2014-08-01 00:12:00","40.7323","-74.0077","B02512\r"] |
> +----------------------------------------------------------+
> 5 rows selected (0.184 seconds)
> {noformat}
> But when we do a select on columns[3] we see a different value in the query result.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[3] from `uber_trip_data` limit 5;
> +----------+
> | EXPR$0 |
> +----------+
> |02512
> |02512
> |02512
> |02512
> |02512
> +----------+
> 5 rows selected (0.159 seconds)
> {noformat}
> Searching for 'B02512' returns no rows. (where as it should have returned data)
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select * from `uber_trip_data` where columns[3]='B02512';
> +----------+
> | columns |
> +----------+
> +----------+
> No rows selected (1.707 seconds)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)