You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2011/05/03 02:18:03 UTC

[jira] [Created] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

reduce workload generated by JDBCStatsPublisher
-----------------------------------------------

                 Key: HIVE-2144
                 URL: https://issues.apache.org/jira/browse/HIVE-2144
             Project: Hive
          Issue Type: Improvement
            Reporter: Ning Zhang
            Assignee: Ning Zhang


In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038348#comment-13038348 ] 

Ning Zhang commented on HIVE-2144:
----------------------------------

+1. Will commit if tests pass. 

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038932#comment-13038932 ] 

Hudson commented on HIVE-2144:
------------------------------

Integrated in Hive-trunk-h0.21 #748 (See [https://builds.apache.org/hudson/job/Hive-trunk-h0.21/748/])
    HIVE-2144. reduce workload generated by JDBCStatsPublisher (Tomasz Nykiel via Ning Zhang)

nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1127229
Files : 
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java


> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038419#comment-13038419 ] 

Ning Zhang commented on HIVE-2144:
----------------------------------

Tom, TestMetaStoreAuthorization failed with 1 Error. Can you take a look?

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038757#comment-13038757 ] 

Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

Thanks :)

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang resolved HIVE-2144.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0
     Hadoop Flags: [Reviewed]

Committed. Thanks Tom!

Congrats on your 1st commit!

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039417#comment-13039417 ] 

Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

IMPORTANT NOTE!

Before deployment, the primary key constraint needs to be added manually on the ID column of PARTITION_STAT_TBL, if the table already exists.
Otherwise, the statistics might be duplicated for some entries, and the aggregated statistics will be silently incorrect.

If the table does not exist, it will be created in the proper format.



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035493#comment-13035493 ] 

Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

We can amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ).

Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently.
In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table:

UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value
    WHERE ID = "rowID" AND
(0)    new_value >
(1)                (SELECT TEMP.ROW_COUNT FROM
(2)                    (SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = "rowID") TEMP )

--(0) is a condition that checks if the newly inserted value is greater that the one we already have.
--(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison.
--(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate.

To summarize, for non existing rows, only one insert query will be executed, instead of two.
For existing rows, which seems to occur very infrequently, two queries instead of three will be executed.


> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036707#comment-13036707 ] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review696
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1395>

    Here we need to catch SQLRecoverableException and retry.

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1396>

    If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml.

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1397>

    the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated.

    Can you add one more test for aggregateStats('file_000')?

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1398>

    won't this also change the stats at the 2nd publishStat()?

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1399>

    also add another aggStats for prefix.

- Ning

On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/765/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-19 23:14:26)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
bq.  The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.
bq.  
bq.  Currently the schema of the stat table is the following:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.
bq.  
bq.  We amend it to:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.  
bq.  HIVE-2144 improves on performance by greedily performing the insertion statement.
bq.  Then instead of executing two queries per row inserted, we can execute one INSERT query.
bq.  In the case primary key constraint violation, we perform a single UPDATE query.
bq.  The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.
bq.  
bq.  
bq.  This addresses bug HIVE-2144.
bq.      https://issues.apache.org/jira/browse/HIVE-2144
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/765/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestStatsPublisher JUnit test:
bq.  - basic behaviour
bq.  - multiple updates
bq.  - cleanup of the statistics table after aggregation
bq.  
bq.  Standalone testing on the cluster.
bq.  - insert/analyze queries over non-partitioned/partitioned tables
bq.  
bq.  NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036932#comment-13036932 ] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review702
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1402>

    Yes. That's correct.

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1403>

    ok.

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1404>

    I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match.

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1405>

    The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values. 

trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1406>

    As disscussed before, I will improve the test cases to aggregate over prefixes.

- Tomasz

On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/765/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-19 23:14:26)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
bq.  The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.
bq.  
bq.  Currently the schema of the stat table is the following:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.
bq.  
bq.  We amend it to:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.  
bq.  HIVE-2144 improves on performance by greedily performing the insertion statement.
bq.  Then instead of executing two queries per row inserted, we can execute one INSERT query.
bq.  In the case primary key constraint violation, we perform a single UPDATE query.
bq.  The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.
bq.  
bq.  
bq.  This addresses bug HIVE-2144.
bq.      https://issues.apache.org/jira/browse/HIVE-2144
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/765/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestStatsPublisher JUnit test:
bq.  - basic behaviour
bq.  - multiple updates
bq.  - cleanup of the statistics table after aggregation
bq.  
bq.  Standalone testing on the cluster.
bq.  - insert/analyze queries over non-partitioned/partitioned tables
bq.  
bq.  NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036566#comment-13036566 ] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
-----------------------------------------------------------

Review request for hive.


Summary
-------

Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion statement.
Then instead of executing two queries per row inserted, we can execute one INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.


This addresses bug HIVE-2144.
    https://issues.apache.org/jira/browse/HIVE-2144


Diffs
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
-------

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.


Thanks,

Tomasz



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2144:
--------------------------------

    Attachment: HIVE-2144.2.patch

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038236#comment-13038236 ] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review709
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1417>

    can you add a comment on what situation this exception will be thrown? Just for the sake of reader that didn't notice there is a primary key constraint in the DDL. 

trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1418>

    remove this?

- Ning

On 2011-05-21 01:49:07, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/765/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-21 01:49:07)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
bq.  The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.
bq.  
bq.  Currently the schema of the stat table is the following:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.
bq.  
bq.  We amend it to:
bq.  
bq.  PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
bq.  
bq.  HIVE-2144 improves on performance by greedily performing the insertion statement.
bq.  Then instead of executing two queries per row inserted, we can execute one INSERT query.
bq.  In the case primary key constraint violation, we perform a single UPDATE query.
bq.  The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.
bq.  
bq.  
bq.  This addresses bug HIVE-2144.
bq.      https://issues.apache.org/jira/browse/HIVE-2144
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/765/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestStatsPublisher JUnit test:
bq.  - basic behaviour
bq.  - multiple updates
bq.  - cleanup of the statistics table after aggregation
bq.  
bq.  Standalone testing on the cluster.
bq.  - insert/analyze queries over non-partitioned/partitioned tables
bq.  
bq.  NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038735#comment-13038735 ] 

Ning Zhang commented on HIVE-2144:
----------------------------------

Actually the failure was due to port confliction on my test machine. Tested on another box and it passed. 

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037221#comment-13037221 ] 

jiraposter@reviews.apache.org commented on HIVE-2144:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/
-----------------------------------------------------------

(Updated 2011-05-21 01:49:07.819494)


Review request for hive.


Changes
-------

-Amended the test cases to accommodate prefix aggregation
-Fixed unnecessary conf settings
-Fixed exception handling in JDBCStatsPublisher.publishStats -> SQLRecoverableException is handled when executing the update statement.


Summary
-------

Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row.
The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks.

Currently the schema of the stat table is the following:

PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared.

We amend it to:

PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).

HIVE-2144 improves on performance by greedily performing the insertion statement.
Then instead of executing two queries per row inserted, we can execute one INSERT query.
In the case primary key constraint violation, we perform a single UPDATE query.
The UPDATE query needs to check the condition, if the currently inserted stats are "newer" then the ones already in the table.


This addresses bug HIVE-2144.
    https://issues.apache.org/jira/browse/HIVE-2144


Diffs (updated)
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125468 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION 

Diff: https://reviews.apache.org/r/765/diff


Testing
-------

TestStatsPublisher JUnit test:
- basic behaviour
- multiple updates
- cleanup of the statistics table after aggregation

Standalone testing on the cluster.
- insert/analyze queries over non-partitioned/partitioned tables

NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared.


Thanks,

Tomasz



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2144:
--------------------------------

    Attachment: HIVE-2144.patch

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang reassigned HIVE-2144:
--------------------------------

    Assignee: Tomasz Nykiel  (was: Ning Zhang)

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035634#comment-13035634 ] 

Tomasz Nykiel commented on HIVE-2144:
-------------------------------------

Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it does not. So in general, PK constraint will be more suitable.

CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT BIGINT ) works for both Derby and MySql.
After a quick check it seems that it's supported by Oracle/MSSQL as well.



> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2144:
--------------------------------

    Attachment: HIVE-2144.1.patch

Fixed after revision 1.

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2144.1.patch, HIVE-2144.patch
>
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035612#comment-13035612 ] 

Ning Zhang commented on HIVE-2144:
----------------------------------

Great! I like the idea. 

One comment about the primary key constraint: I'm not sure if UNIQUE is the standard way to specify primary key constraint. There are people using Oralce/MS SQL sever/Postgres as metastore, we should use a standard way. I think 'id varchar(255) PRIMARY KEY' is more widely supported. Can you double check with mysql and derby?

> reduce workload generated by JDBCStatsPublisher
> -----------------------------------------------
>
>                 Key: HIVE-2144
>                 URL: https://issues.apache.org/jira/browse/HIVE-2144
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Tomasz Nykiel
>
> In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira