You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Tomasz Nykiel (JIRA)" <ji...@apache.org> on 2011/05/18 19:10:47 UTC
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035493#comment-13035493 ] 

Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

We can amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ).

Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently.
In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table:

UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value
    WHERE ID = "rowID" AND
(0)    new_value >
(1)                (SELECT TEMP.ROW_COUNT FROM
(2)                    (SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = "rowID") TEMP )

--(0) is a condition that checks if the newly inserted value is greater that the one we already have.
--(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison.
--(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate.

To summarize, for non existing rows, only one insert query will be executed, instead of two.
For existing rows, which seems to occur very infrequently, two queries instead of three will be executed.


> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira