You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "John Omernik (JIRA)" <ji...@apache.org> on 2015/12/01 14:59:10 UTC
[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

    [ https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033713#comment-15033713 ] 

John Omernik commented on DRILL-4145:
-------------------------------------

I just tested the apps1-bad.csv on a MapRFS based DFS plugin.  (Perhaps we can focus on S3 here).  Basically, when I ran  the same query as you as I had no issues at all.  I am running the Developer release (Based on the 1.3 release from Apache) of MapR Drill, thus, other then some additions for MapR Tables, we should be same on code base (if you are running 1.3).  

This is interesting to me though, because when I ran the query, instead of interpreting the fields as your setup did, mine returned one field of  "columns" with an array. Thus my "limit 1" query data started out like this:

| ["FIELD_1","FIELD_2","FIELD_3","....

I.e. in your query, it parsed the header field into fields, in mine it returned the all as an array. The reason I bring this up, is I am curious on the differences in our setup. If we are both running 1.3, it should return the same right?  Can you share the formats section of your s3 plugin? I tried to use "extractHeader": true on mine, but got the same result, I am curious on your configuration there.  

I want to get it so we can either hone in the S3 difference, and eliminate configuration or version differences. 

Additionally, can you do select * from sys.version and share the commit_time and build_time on yours?  That may be helpful as well for me.  I have a commit time of 20.11.2015 & 01:34:54 UTC and a build time of 21.11.2015 @ 05:21:04 UTC.   Are you using the official release or are you using a snapshot from Github?

Thanks!


> IndexOutOfBoundsException raised during select * query on S3 csv file
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4145
>                 URL: https://issues.apache.org/jira/browse/DRILL-4145
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>         Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://<bucket-name-was-here>",
>   "workspaces": {
>     "root": {
>       "location": "/",
>       "writable": false,
>       "defaultInputFormat": null
>     },
>     "views": {
>       "location": "/processed",
>       "writable": true,
>       "defaultInputFormat": null
>     },
>     "tmp": {
>       "location": "/tmp",
>       "writable": true,
>       "defaultInputFormat": null
>     }
>   },
>   "formats": {
>     "psv": {
>       "type": "text",
>       "extensions": [
>         "tbl"
>       ],
>       "delimiter": "|"
>     },
>     "csv": {
>       "type": "text",
>       "extensions": [
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     },
>     "tsv": {
>       "type": "text",
>       "extensions": [
>         "tsv"
>       ],
>       "delimiter": "\t"
>     },
>     "parquet": {
>       "type": "parquet"
>     },
>     "json": {
>       "type": "json"
>     },
>     "avro": {
>       "type": "avro"
>     },
>     "sequencefile": {
>       "type": "sequencefile",
>       "extensions": [
>         "seq"
>       ]
>     },
>     "csvh": {
>       "type": "text",
>       "extensions": [
>         "csvh",
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     }
>   }
> }
>            Reporter: Peter McTaggart
>         Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on ip-XXXXX.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | FIELD_1  |       FIELD_2        | FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |       FIELD_12       | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | FIELD_18  | FIELD_19  |       FIELD_20       | FIELD_21  | FIELD_22  | FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | FIELD_35  |
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | 489517   | 27/10/2015 02:05:27  | 261      | 1130232  | 0        | 925630488  | 0        | 925630488  | -1       | 19531580547  | 00000000  | 27/10/2015 02:00:00  |           | 30        | 300       | 0         | 0         | 00000000  | 00000000  | 27/10/2015 02:05:27  | 0         | 1         | 0         | 35.0      |           |           |           | 505       | 872.0     |           | aBc       |           |           |           |           |
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> 1 row selected (1.094 seconds)
> 0: jdbc:drill:>  {noformat}
> Good file: apps1.csv, and 
> Bad file: apps1-bad.csv  attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)