You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jack Crawford (JIRA)" <ji...@apache.org> on 2015/04/02 05:34:53 UTC
[jira] [Commented] (DRILL-2616) strings loaded incorrectly from
parquet files
[ https://issues.apache.org/jira/browse/DRILL-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392076#comment-14392076 ]
Jack Crawford commented on DRILL-2616:
--------------------------------------
When i query through drill, it seems certain strings from some rows are repeated far more often then they appear in the original data. An example query for the first 5 rows shows this under the 'indicator' column. If you look further through the select*, the id column shows it as well, where drill comes back with ~3 or so unique ids, but the actual data source has many more.
query:
select * from dfs.`indicators.parquet` limit 5;
+------------+------------+------------+------------+
| id | timeNanos | indicator | value |
+------------+------------+------------+------------+
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | distNear | -0.0 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555457827764000 | distNear | -4.0612379933691045E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | distNear | -0.0 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555458137319000 | distNear | -2.6080420511220836E-4 |
| generated-f61b58e2-a9d1-43a8-b164-2e292a92dbe7 | 1427555461205550000 | distNear | -0.0 |
+------------+------------+------------+------------+
expected output (verified by loading in spark):
id timeNanos indicator value
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555457827764000 distNear -0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555457827764000 smartDiff -0.000406
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555458137319000 distNear -0.000000
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555458137319000 smartDiff -0.000261
generated-4458776b-4e22-415e-8fd9-29b687f40dce 1427555461205550000 distNear -0.000000
> strings loaded incorrectly from parquet files
> ---------------------------------------------
>
> Key: DRILL-2616
> URL: https://issues.apache.org/jira/browse/DRILL-2616
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jack Crawford
> Assignee: Jason Altekruse
> Priority: Critical
> Labels: parquet
>
> When loading string columns from parquet data sources, some rows have their string values replaced with the value from other rows.
> Example parquet for which the problem occurs:
> https://drive.google.com/file/d/0B2JGBdceNMxdeFlJcW1FUElOdXc/view?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)