You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Chun Chang (JIRA)" <ji...@apache.org> on 2015/10/01 20:27:26 UTC
[jira] [Closed] (DRILL-3209) [Umbrella] Plan reads of Hive tables as native Drill reads when a native reader for the underlying table format exists

     [ https://issues.apache.org/jira/browse/DRILL-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chun Chang closed DRILL-3209.
-----------------------------
    Assignee: Chun Chang  (was: Jason Altekruse)

The following build supports parquet native scan for Hive. 

{noformat}
+-------------------------------------------+---------------------------------------------------+----------------------------+--------------+----------------------------+
|                 commit_id                 |                  commit_message                   |        commit_time         | build_email  |         build_time         |
+-------------------------------------------+---------------------------------------------------+----------------------------+--------------+----------------------------+
| 83ebc7886f1a78e8ccca1a50725a000d3ca928c9  | DRILL-3479: Fix sqlline version for all profiles  | 30.09.2015 @ 20:07:11 UTC  | Unknown      | 30.09.2015 @ 21:30:37 UTC  |
+-------------------------------------------+---------------------------------------------------+----------------------------+--------------+----------------------------+
{noformat}

Using TPCH-100 parquet data, on a 11 node cluster (10.10.103.60-70), verified that with native scan turned on, drill can handle tpch query used to oom. Also noticed, with native scan turned on, performance may suffer. For some queries, performance can be 5-6 times slower.

{noformat}
tpch query	parquet	  hive	 native scan
19.q	                40s	        39s	 40s
	                39s	        39s	 39s
13.q	                35s	        50s	 345s			
01.q	                31s	        61s	 164s			
04.q	                36s	        53s	  210s			
05.q	                42s	        oom	 110s			
06.q	                18s	       40s	 53s
{noformat}

> [Umbrella] Plan reads of Hive tables as native Drill reads when a native reader for the underlying table format exists
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3209
>                 URL: https://issues.apache.org/jira/browse/DRILL-3209
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization, Storage - Hive
>            Reporter: Jason Altekruse
>            Assignee: Chun Chang
>             Fix For: 1.2.0
>
>
> All reads against Hive are currently done through the Hive Serde interface. While this provides the most flexibility, the API is not optimized for maximum performance while reading the data into Drill's native data structures. For Parquet and Text file backed tables, we can plan these reads as Drill native reads. Currently reads of these file types provide untyped data. While parquet has metadata in the file we currently do not make use of the type information while planning. For text files we read all of the files as lists of varchars. In both of these cases, casts will need to be injected to provide the same datatypes provided by the reads through the SerDe interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)