You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2015/04/22 01:16:59 UTC

[jira] [Commented] (DRILL-2760) Quoted strings from CSV file appear in query output in different forms

    [ https://issues.apache.org/jira/browse/DRILL-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505995#comment-14505995 ] 

Khurram Faraaz commented on DRILL-2760:
---------------------------------------

{code}

Single quotes used in the beginning and end of each string inside a CSV file are returned as part of the result, these must be stripped from the results by the query.

0: jdbc:drill:> select * from `singleQuote.csv`;
+------------+
|  columns   |
+------------+
| ["'2342342'","'this is a test string1'","'12112.297'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","'    string with spaces    '"] |
| ["'3042672'","'this is a test string2'","'82212.598'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","'    string with spaces    '"] |
| ["'9142082'","'this is a test string3'","'92612.891'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","'    string with spaces    '"] |
| ["'1942442'","'this is a test string4'","'32812.294'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","'    string with spaces    '"] |
+------------+
4 rows selected (0.26 seconds)

0: jdbc:drill:> select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6] from `singleQuote.csv`;
+------------+------------+------------+------------+------------+------------+------------+
|   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |   EXPR$5   |   EXPR$6   |
+------------+------------+------------+------------+------------+------------+------------+
| '2342342'  | 'this is a test string1' | '12112.297' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | '    string with spaces    ' |
| '3042672'  | 'this is a test string2' | '82212.598' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | '    string with spaces    ' |
| '9142082'  | 'this is a test string3' | '92612.891' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | '    string with spaces    ' |
| '1942442'  | 'this is a test string4' | '32812.294' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | '    string with spaces    ' |
+------------+------------+------------+------------+------------+------------+------------+
4 rows selected (0.324 seconds)

data from CSV file

[root@centos-01 CSV_delim_quotes]# hadoop fs -cat /tmp/singleQuote.csv
'2342342','this is a test string1','12112.297','this is cool!','@#$%^&*()','test@testmail.com','    string with spaces    '
'3042672','this is a test string2','82212.598','this is cool!','@#$%^&*()','test@testmail.com','    string with spaces    '
'9142082','this is a test string3','92612.891','this is cool!','@#$%^&*()','test@testmail.com','    string with spaces    '
'1942442','this is a test string4','32812.294','this is cool!','@#$%^&*()','test@testmail.com','    string with spaces    '

{code}

> Quoted strings from CSV file appear in query output in different forms
> ----------------------------------------------------------------------
>
>                 Key: DRILL-2760
>                 URL: https://issues.apache.org/jira/browse/DRILL-2760
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 0.9.0
>         Environment: | 9d92b8e319f2d46e8659d903d355450e15946533 | DRILL-2580: Exit early from HashJoinBatch if build side is empty | 26.03.2015 @ 16:13:53 EDT
> 4 node cluster on CentOS
>            Reporter: Khurram Faraaz
>            Assignee: Steven Phillips
>             Fix For: 1.0.0
>
>
> Quoted strings appear in query output in different forms, as shown in the section below.
> Quotes should NOT appear in query output. Strings must be stripped of their leading and prevailing quotes. (I am referring to this character - " )
> Snippet of data from airports.cv file, first three lines, the first line has header information.
> {code}
> [root@centos-01 airport_CSV_data]# head -3 airports.csv
> "id","ident","type","name","latitude_deg","longitude_deg","elevation_ft","continent","iso_country","iso_region","municipality","scheduled_service","gps_code","iata_code","local_code","home_link","wikipedia_link","keywords"
> 6523,"00A","heliport","Total Rf Heliport",40.07080078125,-74.9336013793945,11,"NA","US","US-PA","Bensalem","no","00A",,"00A",,,
> 6524,"00AK","small_airport","Lowell Field",59.94919968,-151.695999146,450,"NA","US","US-AK","Anchor Point","no","00AK",,"00AK",,,
> {code}
> case 1) In this case quotes are not escaped, they appear in the output as is.
> {code}
> 0: jdbc:drill:> select columns[0] id,columns[1] ident,columns[2] type,columns[3] name,columns[4] latitude_deg,columns[5] longitude_deg,columns[6] elevation_ft,columns[7] continent,columns[8] iso_country,columns[9] iso_region,columns[10] municipality,columns[11] scheduled_service,columns[12] gps_code,columns[13] iata_code, columns[14] local_code,columns[15] home_link,columns[16] wikipedia_link,columns[17] keywords from `airports.csv` limit 3;
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> |     id     |   ident    |    type    |    name    | latitude_deg | longitude_deg | elevation_ft | continent  | iso_country | iso_region | municipality | scheduled_service |  gps_code  | iata_code  | local_code | home_link  | wikipedia_link |  keywords  |
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> | "id"       | "ident"    | "type"     | "name"     | "latitude_deg" | "longitude_deg" | "elevation_ft" | "continent" | "iso_country" | "iso_region" | "municipality" | "scheduled_service" | "gps_code" | "iata_code" | "local_code" | "home_link" | "wikipedia_link" | "keywords" |
> | 6523       | "00A"      | "heliport" | "Total Rf Heliport" | 40.07080078125 | -74.9336013793945 | 11           | "NA"       | "US"        | "US-PA"    | "Bensalem"   | "no"              | "00A"      |            | "00A"      |            |                | null       |
> | 6524       | "00AK"     | "small_airport" | "Lowell Field" | 59.94919968  | -151.695999146 | 450          | "NA"       | "US"        | "US-AK"    | "Anchor Point" | "no"              | "00AK"     |            | "00AK"     |            |                | null       |
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> 3 rows selected (0.155 seconds)
> {code}
> In this case quotes appear in the query output but they are escaped with backslash character in the output.
> {code}
> 0: jdbc:drill:> select * from `airports.csv` limit 3;
> +------------+
> |  columns   |
> +------------+
> | ["\"id\"","\"ident\"","\"type\"","\"name\"","\"latitude_deg\"","\"longitude_deg\"","\"elevation_ft\"","\"continent\"","\"iso_country\"","\"iso_region\"","\"municipality\"","\"scheduled_service\"","\"gps_code\"","\"iata_code\"","\"local_code\"","\"home_link\"","\"wikipedia_link\"","\"keywords\""] |
> | ["6523","\"00A\"","\"heliport\"","\"Total Rf Heliport\"","40.07080078125","-74.9336013793945","11","\"NA\"","\"US\"","\"US-PA\"","\"Bensalem\"","\"no\"","\"00A\"","","\"00A\"","",""] |
> | ["6524","\"00AK\"","\"small_airport\"","\"Lowell Field\"","59.94919968","-151.695999146","450","\"NA\"","\"US\"","\"US-AK\"","\"Anchor Point\"","\"no\"","\"00AK\"","","\"00AK\"","",""] |
> +------------+
> 3 rows selected (0.097 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)