You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Kunal Khatua (JIRA)" <ji...@apache.org> on 2017/03/21 00:47:41 UTC
[jira] [Updated] (DRILL-4982) Hive Queries degrade when queries
switch between different formats
[ https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kunal Khatua updated DRILL-4982:
--------------------------------
Reviewer: Rahul Challapalli
[~rkins] I think this has been verified, so close it if the commit is now in Master.
> Hive Queries degrade when queries switch between different formats
> ------------------------------------------------------------------
>
> Key: DRILL-4982
> URL: https://issues.apache.org/jira/browse/DRILL-4982
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Chunhui Shi
> Assignee: Karthikeyan Manivannan
> Priority: Critical
> Fix For: 1.10.0
>
>
> We have seen degraded performance by doing these steps:
> 1) generate the repro data:
> python script repro.py as below:
> import string
> import random
>
> for i in range(30000000):
> x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19, 27)))
> print "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6)
> python repro.py > repro.csv
> 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt, use the following sql command to create an external table:
> CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'
> 3) create Hive's table of ORC|PARQUET format:
> CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> 4) Query switch between these two tables, then the query time on the same table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats, after injecting a query to other formats, all have significant slow down.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)