You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Khurram Faraaz (JIRA)" <ji...@apache.org> on 2015/04/22 01:16:59 UTC
[jira] [Commented] (DRILL-2760) Quoted strings from CSV file appear
in query output in different forms
[ https://issues.apache.org/jira/browse/DRILL-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505995#comment-14505995 ]
Khurram Faraaz commented on DRILL-2760:
---------------------------------------
{code}
Single quotes used in the beginning and end of each string inside a CSV file are returned as part of the result, these must be stripped from the results by the query.
0: jdbc:drill:> select * from `singleQuote.csv`;
+------------+
| columns |
+------------+
| ["'2342342'","'this is a test string1'","'12112.297'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","' string with spaces '"] |
| ["'3042672'","'this is a test string2'","'82212.598'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","' string with spaces '"] |
| ["'9142082'","'this is a test string3'","'92612.891'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","' string with spaces '"] |
| ["'1942442'","'this is a test string4'","'32812.294'","'this is cool!'","'@#$%^&*()'","'test@testmail.com'","' string with spaces '"] |
+------------+
4 rows selected (0.26 seconds)
0: jdbc:drill:> select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6] from `singleQuote.csv`;
+------------+------------+------------+------------+------------+------------+------------+
| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 | EXPR$4 | EXPR$5 | EXPR$6 |
+------------+------------+------------+------------+------------+------------+------------+
| '2342342' | 'this is a test string1' | '12112.297' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | ' string with spaces ' |
| '3042672' | 'this is a test string2' | '82212.598' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | ' string with spaces ' |
| '9142082' | 'this is a test string3' | '92612.891' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | ' string with spaces ' |
| '1942442' | 'this is a test string4' | '32812.294' | 'this is cool!' | '@#$%^&*()' | 'test@testmail.com' | ' string with spaces ' |
+------------+------------+------------+------------+------------+------------+------------+
4 rows selected (0.324 seconds)
data from CSV file
[root@centos-01 CSV_delim_quotes]# hadoop fs -cat /tmp/singleQuote.csv
'2342342','this is a test string1','12112.297','this is cool!','@#$%^&*()','test@testmail.com',' string with spaces '
'3042672','this is a test string2','82212.598','this is cool!','@#$%^&*()','test@testmail.com',' string with spaces '
'9142082','this is a test string3','92612.891','this is cool!','@#$%^&*()','test@testmail.com',' string with spaces '
'1942442','this is a test string4','32812.294','this is cool!','@#$%^&*()','test@testmail.com',' string with spaces '
{code}
> Quoted strings from CSV file appear in query output in different forms
> ----------------------------------------------------------------------
>
> Key: DRILL-2760
> URL: https://issues.apache.org/jira/browse/DRILL-2760
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 0.9.0
> Environment: | 9d92b8e319f2d46e8659d903d355450e15946533 | DRILL-2580: Exit early from HashJoinBatch if build side is empty | 26.03.2015 @ 16:13:53 EDT
> 4 node cluster on CentOS
> Reporter: Khurram Faraaz
> Assignee: Steven Phillips
> Fix For: 1.0.0
>
>
> Quoted strings appear in query output in different forms, as shown in the section below.
> Quotes should NOT appear in query output. Strings must be stripped of their leading and prevailing quotes. (I am referring to this character - " )
> Snippet of data from airports.cv file, first three lines, the first line has header information.
> {code}
> [root@centos-01 airport_CSV_data]# head -3 airports.csv
> "id","ident","type","name","latitude_deg","longitude_deg","elevation_ft","continent","iso_country","iso_region","municipality","scheduled_service","gps_code","iata_code","local_code","home_link","wikipedia_link","keywords"
> 6523,"00A","heliport","Total Rf Heliport",40.07080078125,-74.9336013793945,11,"NA","US","US-PA","Bensalem","no","00A",,"00A",,,
> 6524,"00AK","small_airport","Lowell Field",59.94919968,-151.695999146,450,"NA","US","US-AK","Anchor Point","no","00AK",,"00AK",,,
> {code}
> case 1) In this case quotes are not escaped, they appear in the output as is.
> {code}
> 0: jdbc:drill:> select columns[0] id,columns[1] ident,columns[2] type,columns[3] name,columns[4] latitude_deg,columns[5] longitude_deg,columns[6] elevation_ft,columns[7] continent,columns[8] iso_country,columns[9] iso_region,columns[10] municipality,columns[11] scheduled_service,columns[12] gps_code,columns[13] iata_code, columns[14] local_code,columns[15] home_link,columns[16] wikipedia_link,columns[17] keywords from `airports.csv` limit 3;
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> | id | ident | type | name | latitude_deg | longitude_deg | elevation_ft | continent | iso_country | iso_region | municipality | scheduled_service | gps_code | iata_code | local_code | home_link | wikipedia_link | keywords |
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> | "id" | "ident" | "type" | "name" | "latitude_deg" | "longitude_deg" | "elevation_ft" | "continent" | "iso_country" | "iso_region" | "municipality" | "scheduled_service" | "gps_code" | "iata_code" | "local_code" | "home_link" | "wikipedia_link" | "keywords" |
> | 6523 | "00A" | "heliport" | "Total Rf Heliport" | 40.07080078125 | -74.9336013793945 | 11 | "NA" | "US" | "US-PA" | "Bensalem" | "no" | "00A" | | "00A" | | | null |
> | 6524 | "00AK" | "small_airport" | "Lowell Field" | 59.94919968 | -151.695999146 | 450 | "NA" | "US" | "US-AK" | "Anchor Point" | "no" | "00AK" | | "00AK" | | | null |
> +------------+------------+------------+------------+--------------+---------------+--------------+------------+-------------+------------+--------------+-------------------+------------+------------+------------+------------+----------------+------------+
> 3 rows selected (0.155 seconds)
> {code}
> In this case quotes appear in the query output but they are escaped with backslash character in the output.
> {code}
> 0: jdbc:drill:> select * from `airports.csv` limit 3;
> +------------+
> | columns |
> +------------+
> | ["\"id\"","\"ident\"","\"type\"","\"name\"","\"latitude_deg\"","\"longitude_deg\"","\"elevation_ft\"","\"continent\"","\"iso_country\"","\"iso_region\"","\"municipality\"","\"scheduled_service\"","\"gps_code\"","\"iata_code\"","\"local_code\"","\"home_link\"","\"wikipedia_link\"","\"keywords\""] |
> | ["6523","\"00A\"","\"heliport\"","\"Total Rf Heliport\"","40.07080078125","-74.9336013793945","11","\"NA\"","\"US\"","\"US-PA\"","\"Bensalem\"","\"no\"","\"00A\"","","\"00A\"","",""] |
> | ["6524","\"00AK\"","\"small_airport\"","\"Lowell Field\"","59.94919968","-151.695999146","450","\"NA\"","\"US\"","\"US-AK\"","\"Anchor Point\"","\"no\"","\"00AK\"","","\"00AK\"","",""] |
> +------------+
> 3 rows selected (0.097 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)