You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Steven Phillips (JIRA)" <ji...@apache.org> on 2015/04/16 22:53:59 UTC

[jira] [Commented] (DRILL-2806) Querying data from compressed csv file returns nulls and unreadable data

    [ https://issues.apache.org/jira/browse/DRILL-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498709#comment-14498709 ] 

Steven Phillips commented on DRILL-2806:
----------------------------------------

The problem is you have the wrong extension. To read gzip compressed text files, the files need to have the .gz extension.  As is, Drill does not recognize that this is a sorted file, and doesn't recognize the .tgz extension, so it is using the default format for the workspace, and attempting to read it as such, which returns garbage data.

> Querying data from compressed csv file returns nulls and unreadable data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-2806
>                 URL: https://issues.apache.org/jira/browse/DRILL-2806
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 0.9.0
>         Environment: 9d92b8e319f2d46e8659d903d355450e15946533 | DRILL-2580: Exit early from HashJoinBatch if build side is empty | 26.03.2015
>            Reporter: Khurram Faraaz
>            Assignee: Steven Phillips
>
> Project columns from a compressed CSV data file returns unreadable data and nulls in the query results. Querying the same CSV file in uncompressed format, the query returns correct results, readable data and no nulls. Test was performed on 4 node cluster on CentOS.
> {code}
> 0: jdbc:drill:> select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7] from `deletions-00000-of-00020.tgz` limit 10;
> +------------+------------+------------+------------+------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |   EXPR$5   |   EXPR$6   |   EXPR$7   |
> +------------+------------+------------+------------+------------+------------+------------+------------+
> | 0U[ˮȑ|axaR)ﺫ=鲍i̊HDJ|?3̑$%Q$%
>                                                 TdfD8'2i$E^/Y}C'>|/7
>                                                                                   H1o0! | 0g TMUܸW`ʙ&T
>                                                                                                                                 \uXپN|2I~Y 0RAX6UaXe+ow*]s | null       | null       | null       | null       | null       | null       |
> | oM.ڻU/ | ̼\
>                            )qwda7((
>                                                	y[) | 9>^0>WM[{r]iE$ze&!EküIfa | null       | null       | null       | null       | null       |
> | SRΠ     | null       | null       | null       | null       | null       | null       | null       |
> | 6imJ\f_dYڿ]%ln3IaE*BGA-a$j:M!Uc)ﶘD~wUx0ɼgme]ӘcQ*pk$%\2ER-)(ÈxTn?SϓxeҜݠºI|'(Cni	s | null       | null       | null       | null       | null       | null       | null       |
> | bxΜkr4ü_nIxl_s`vN	ó.$OL7Eބyڗia;Pu$M!AoCӦnlS-`ۢ+o~>%wzcgwtMge7"lMgZ=WྃgMRX1"a | X=Rd.fab{t{
>                                                                                                                                                                                        A!t
>                                                                                                                                                                                                1$ڧw-0EXURg
>                                                                                                                                                                                                                       p	#qzߤ΢gWMem{=z{
>                                                                                                                                                                                                                                                     eiA]^ | null       | null       | null       | null       | null       | null       |
> | ֌        | null       | null       | null       | null       | null       | null       | null       |
> | !{1H*m71`˰]oZ | 𾳔] &f4Z)4SP7Rm4^5WWXȧ<p.́3L
>                                                                                       q%|WL-p[ | null       | null       | null       | null       | null       | null       |
> | dqyd\K#"ԁ@ | null       | null       | null       | null       | null       | null       | null       |
> | [GԊKFlɢ(ZK8h#D/[(U=_8ΏE%
>                                                            [;
>                                                               w}Fr`#Xk
>                                                                               lT'15:y
>                                                                                                ņPz(-ȓ񆹞Cs)1v	 | null       | null       | null       | null       | null       | null       | null       |
> | LyPO|Ώ(+n+H]
>                          Ņ2?糩s/_ l
>                                             +ӯb	 | null       | null       | null       | null       | null       | null       | null       |
> +------------+------------+------------+------------+------------+------------+------------+------------+
> 10 rows selected (0.176 seconds)
> 0: jdbc:drill:> select columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7] from `deletions/deletions-00000-of-00020.csv` limit 10;
> +------------+------------+------------+------------+------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |   EXPR$5   |   EXPR$6   |   EXPR$7   |
> +------------+------------+------------+------------+------------+------------+------------+------------+
> | 1354980518007 | /user/mwcl_musicbrainz | 1356247116000 | /user/google_gardener | /m/0nj707g | /music/track_contribution/contributor | /m/09xmq3  | en         |
> | 1359609261000 | /user/ahsan2002us | 1359697206000 | /user/mjsigua | /m/0q47ym9 | /common/topic/description | Afrosheen CEO is the fictional character from the 2003 film The Watermelon Heist. | en         |
> | 1258294630005 | /user/book_bot | 1260214155000 | /user/book_bot | /m/08g19rh | /book/book_edition/book | /m/04sty07 | en         |
> | 1260232964000 | /user/book_bot | 1360880749000 | /user/turtlewax_bot | /m/0872_f2 | /book/book_edition/book | /m/069_gyc | en         |
> | 1320298552000 | /user/gardening_bot | 1358083965004 | /user/googlebot | /m/01dy3t2 | /type/object/type | /music/single | en         |
> | 1360430129006 | /user/mwcl_musicbrainz | 1362830875001 | /user/mwcl_musicbrainz | /m/0qm1x62 | /music/release_track/release | /m/0ql38vr | en         |
> | 1269251105000 | /user/mwcl_images | 1336539194001 | /user/gardening_bot | /m/06w7yw7 | /common/topic/image | /m/0bcncxt | en         |
> | 1225386250001 | /user/mwcl_images | 1336080683003 | /user/gardening_bot | /m/04sb526 | /common/licensed_object/license | /m/02x6b   | en         |
> | 1286991487000 | /user/mw_template_bot | 1362532733000 | /user/wikipedia_facts | /m/0dgs170 | /people/person/date_of_birth | 1975       | en         |
> | 1258986090000 | /user/book_bot | 1260138587000 | /user/book_bot | /m/08r_m33 | /book/book_edition/book | /m/04sty07 | en         |
> +------------+------------+------------+------------+------------+------------+------------+------------+
> 10 rows selected (0.25 seconds)
> Details of the files (compressed and uncompressed)
> [root@centos-01 ~]# hadoop fs -ls /tmp/deletions-00000-of-00020.tgz
> -rwxr-xr-x   3 root root  111364147 2015-04-16 20:35 /tmp/deletions-00000-of-00020.tgz
> [root@centos-01 ~]# hadoop fs -ls /tmp/deletions/deletions-00000-of-00020.csv
> -rwxr-xr-x   3 root root  395624293 2015-04-14 18:10 /tmp/deletions/deletions-00000-of-00020.csv
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)