You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2017/02/08 01:23:41 UTC

[jira] [Comment Edited] (HIVE-15682) Eliminate per-row based dummy iterator creation

    [ https://issues.apache.org/jira/browse/HIVE-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857172#comment-15857172 ] 

Xuefu Zhang edited comment on HIVE-15682 at 2/8/17 1:23 AM:
------------------------------------------------------------

Hi [~Ferd], when I ran the query, I had two day's data which is about 25m rows. I just ran the query again, with about 10 day's data, the runtime is about 600s with 130m rows. I have 32 executors, each having 4 cores. The query spends most of the time on the second stage where sorting via a single reducer occurs.

I don't think the scale matters much as long as the query runs for sometime (in minutes at least).  Thus, you should be able to use TPC-DS (or its alternatives) data for this exercise.


was (Author: xuefuz):
Hi [~Ferd], when I ran the query, I had two day's data which is about 25m rows. I just ran the query again, with about 10 day's data, the runtime is about 600s with 130m rows. I have 32 executors, each having 4 cores. The query spends most of the time on the second stage where sorting via a single reducer occurs.

I don't think the scale matters much as long as the query runs for sometime (in minutes at least).  Thus, you should be able to use TPC-DS data for this exercise.

> Eliminate per-row based dummy iterator creation
> -----------------------------------------------
>
>                 Key: HIVE-15682
>                 URL: https://issues.apache.org/jira/browse/HIVE-15682
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.2.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>             Fix For: 2.2.0
>
>         Attachments: HIVE-15682.patch
>
>
> HIVE-15580 introduced a dummy iterator per input row which can be eliminated. This is because {{SparkReduceRecordHandler}} is able to handle single key value pairs. We can refactor this part of code 1. to remove the need for a iterator and 2. to optimize the code path for per (key, value) based (instead of (key, value iterator)) processing. It would be also great if we can measure the performance after the optimizations and compare to performance prior to HIVE-15580.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)