You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/05/22 01:16:17 UTC

[jira] Created: (HIVE-1361) table/partition level statistics

table/partition level statistics
--------------------------------

                 Key: HIVE-1361
                 URL: https://issues.apache.org/jira/browse/HIVE-1361
             Project: Hadoop Hive
          Issue Type: Sub-task
    Affects Versions: 0.6.0
            Reporter: Ning Zhang
            Assignee: Ahmed M Aly
             Fix For: 0.6.0


At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 

There are 3 major milestones in this subtask: 
 1) extend the insert statement to gather table/partition level stats on-the-fly.
 2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
 3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 

The proposed stats are:

Partition-level stats: 
  - number of rows
  - total size in bytes
  - number of files
  - max, min, average row sizes
  - max, min, average file sizes

Table-level stats in addition to partition level stats:
  - number of partitions


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment:     (was: HIVE-1361.3.patch)

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.3.patch

Uploading HIVE-1361.3.patch which passes all tests on hadoop 0.20 &0.17. The only difference from the last patch is the log change in stats2.q.out.

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910350#action_12910350 ] 

John Sichi commented on HIVE-1361:
----------------------------------

Yay for Java-only patch :)

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870264#action_12870264 ] 

Ning Zhang commented on HIVE-1361:
----------------------------------

all these stats should be able to collected automatically at insert time. since loading doesn't scan the data, we cannot gather stats from this command. 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.6.0
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910380#action_12910380 ] 

HBase Review Board commented on HIVE-1361:
------------------------------------------

Message from: "Carl Steinbach" <ca...@cloudera.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/862/
-----------------------------------------------------------

Review request for Hive Developers.


Summary
-------

HIVE-1361


This addresses bug HIVE-1361.
    http://issues.apache.org/jira/browse/HIVE-1361


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 997199 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java PRE-CREATION 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java PRE-CREATION 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java PRE-CREATION 
  trunk/ql/src/gen-javabean/org/apache/hadoop/hive/ql/plan/api/StageType.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JobCloseFeedBack.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MoveTask.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FileSinkDesc.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/StatsWork.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 997199 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java PRE-CREATION 

Diff: http://review.cloudera.org/r/862/diff


Testing
-------


Thanks,

Carl




> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910813#action_12910813 ] 

HBase Review Board commented on HIVE-1361:
------------------------------------------

Message from: "John Sichi" <js...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/862/#review1262
-----------------------------------------------------------



trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<http://review.cloudera.org/r/862/#comment4256>

    Hive conf additions should be accompanied by new entries in conf/hive-default.xml for documentation purposes.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4257>

    Using e.toString() alone here may lose some of the diagnostics.
    
    LOG.error has an overload which takes a Throwable parameter; use that to make sure that all the diagnostics (e.g. nested throwables) are logged.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4258>

    As a performance followup, we probably want to use delete(List<Delete>) for batching.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4264>

    See perf comment above.  Also, this scan+delete code could be shared to avoid duplication.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4266>

    See comments in HBaseStatsAggregator regarding diagnostics.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4267>

    Another perf note:  for batch update, we can use setAutoFlush(false) and then flushCommits in closeConnection.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
<http://review.cloudera.org/r/862/#comment4268>

    Probably need a followup to make this configurable.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<http://review.cloudera.org/r/862/#comment4269>

    What is this code for?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4294>

    Isn't this going to throw an NPE if aggregateStats returns null after handling an error?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4295>

    s/retur/return/



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
<http://review.cloudera.org/r/862/#comment4274>

    typo:  MapRedTaks



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4278>

    Some more overview (or just link to updated wiki doc) would be good here since the methods below reference things like temporary stats and aggregation without really explaining them.
    
    Also:  I think having the publisher/aggregator implementations catch errors themselves is confusing.  It would be cleaner to let them propagate the exceptions, and instead catch+suppress+warn in the calling code (under control of a strictness config param).
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4279>

    @param, @return?



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4277>

    Use correct Javadoc @param syntax, and add @return.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4280>

    Use @param, @return



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java
<http://review.cloudera.org/r/862/#comment4282>

    For other plugin-loading code, we use JavaUtils.getClassLoader().  Should probably do the same here?
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java
<http://review.cloudera.org/r/862/#comment4281>

    Don't use printStackTrace; log the exception instead.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java
<http://review.cloudera.org/r/862/#comment4283>

    See comments on StatsAggregator regarding Javadoc.  Also, s/statics/statistics/



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4284>

    I don't think this warrants four exclamation marks.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4288>

    Is it worth using a prepared statement here?  
    
    Also, depending on the transaction isolation level, concurrent update attempts could result in spurious error logging, although the final result should still be valid.
    
    Using a positioned update can give more control over this (and allow the number of SQL statements executed to be reduced), but not all JDBC drivers support it.
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4286>

    This is superfluous.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4292>

    I think this actually means "explicitly shut down the database" (which is different from closing the connection).
    


- John





> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.5.java_only.patch
                HIVE-1361.5.patch

uploading a new set of patches that resolves the conflicts with the latest commits. 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.5.java_only.patch, HIVE-1361.5.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910851#action_12910851 ] 

HBase Review Board commented on HIVE-1361:
------------------------------------------

Message from: "namit jain" <nj...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/862/#review1264
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<http://review.cloudera.org/r/862/#comment4293>

    This code seems useless



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
<http://review.cloudera.org/r/862/#comment4296>

    How are you accounting for speculative 
    execution ?
    
    Can 2 tasks insert the entry ?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4305>

    It might be a good idea to make it easy 
    to add new stats. Right now, you will need
    to fix code in multiple places.
    
    Instead of hard-coding nRowsInTable, it
    would be good to keep an array of stats
    we are publishing in a central place



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4306>

    This (addOutputs()) should be done at 
    compile time



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
<http://review.cloudera.org/r/862/#comment4310>

    Most of these parameters need not be 
    instance variables - have a new function
    where these are defined



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
<http://review.cloudera.org/r/862/#comment4307>

    Can you add publishStats in Utilities and
    let TableScan and FileSink share it



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
<http://review.cloudera.org/r/862/#comment4309>

    I am assuming these red blocks mean TABs



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
<http://review.cloudera.org/r/862/#comment4311>

    ??



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4314>

    Do we need to lock the row ?
    use a SELECT FOR UPDATE instead of
    SELECT


- namit





> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Status: Patch Available  (was: Open)

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910470#action_12910470 ] 

Namit Jain commented on HIVE-1361:
----------------------------------

Will take a look 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Status: Patch Available  (was: Open)

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914728#action_12914728 ] 

He Yongqiang commented on HIVE-1361:
------------------------------------

+1 running tests.

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.5.java_only.patch, HIVE-1361.5.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881448#action_12881448 ] 

Ning Zhang commented on HIVE-1361:
----------------------------------

Some comments from internal design review:
 - The ANALYZE TABLE command should be integrated with the data replication hook. When an existing table/partition is analyzed, a new WriteEntity should be generated to make metadata replication work. 
 - Investigate JDO on top of HBase integration. If JDO works on HBase, we could just use JDO to update column stats as well. 
 - ANALYZE TABLE partition (<partition_spec>) should support "dynamic-partition-style" partition specification. This means the if there are 2 partition columns ds, hr, we can do analyze table partition(ds = '2010-06-01', hr) to analyze all hr sub-partitions under ds='2010-06-01'. 



> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ahmed M Aly (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ahmed M Aly updated HIVE-1361:
------------------------------

    Attachment: stats0.patch

This is an initial patch for quick review. This is not final as there are still some minor changes to be done.

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>         Attachments: stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.4.java_only.patch
                HIVE-1361.4.patch

Uploading new patch that refreshed to the latest trunk. Also added a negative test case analyze.q and some trivial clean up in Java code (removing commented out contents). 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.2.patch
                HIVE-1361.2_java_only.patch

Uploading a new patch (including a full version and a Java_only version including XML build files) for review. This is against the latest trunk.

The major changes from the last patch include: 
  1) Make JDBC update/insert/select using PreparedStatement(). 
  2) In HBase, use HTable.delete(ArrayList<Delete>) to speed up delete, and flushCommit() to batch update. 
  3) Refactor StatsTask to put stats into PartitionStatistics and TableStatistics so that it is easier to add new stats later. 
  4) Move WriteEntity creation from StatsTask to compile-time.

 I'm running tests again after refreshed to the latest trunk.

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896634#action_12896634 ] 

Ning Zhang commented on HIVE-1361:
----------------------------------

Ahmed has put up the design doc on the wiki: http://wiki.apache.org/hadoop/Hive/StatsDev.

Ahmed is also finalizing the patch for review. 

There are some minor changes from the original requirement: currently the stats gather are # of rows, total size in bytes, # files and # of partitions (for table). It does not have the min/max/avg of row/file sizes since they are different in the raw size (serialized and compressed) with the sizes we saw during stats gathering (deserialized and decompressed). And there are no strong use cases for them currently, so we'll exclude them for this patch. 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1361:
-----------------------------

    Status: Open  (was: Patch Available)

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1361:
-------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

I just committed! Thanks Ning and Ahmed!

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.5.java_only.patch, HIVE-1361.5.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.patch
                HIVE-1361.java_only.patch

Uploading a full version (HIVE-1361.patch) and a Java code only version (HIVE-1361.java_only.patch). 

This patch is based on Ahmed's previous patch and implements the following feature:
  1) automatically gather stats (number of rows currently) whenever an INSERT OVERWRITE TABLE is issued. Each mapper/reducer push their partial stats to either MySQL/Derby through JDBC or HBase. The INSERT OVERWRITE statement could be anything include dynamic partition insert, multi-table inserts and inserting to bucketized partitions. A StatsTask is responsible for aggregating partial stats at the end of the query and update the metastore.
  2) The stats of a table/partition can be exposed to the user by 'DESC EXTENDED' to the table/partition. They are stored as the storage parameters (numRows, nuFiles, numPartitions). 
  3) Introducing a new command 'ANALYZE TABLE [PARTITION (PARTITION SPEC)] COMPUTE STATISTICS' to scan the table/partition and gather stats in a similar fashion as INSERT OVERWRITE command except that the plan has only 1 MR job consisting a TableScanOperator and a StatsTask. Partition spec could be full partition spec or partial partition spec similar to what dynamic partition insert uses. This allows the user to analyze a subset/all partitions of a table. The resulting stats are stored in the same parameter in the meatstore.

Tested locally (unit tests) for JDBC:derby, hbase and on a cluster with JDBC:MySQL. 

Will run the full unit tests again. 

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870240#action_12870240 ] 

Namit Jain commented on HIVE-1361:
----------------------------------

Can we further break them down to the stats which will be collected automatically at insert/load time vs. the stats which will be collected when the user explicitly analyzes the table ?

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.6.0
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913783#action_12913783 ] 

Namit Jain commented on HIVE-1361:
----------------------------------

Ning, the latest patch contains the output of svn stat

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1361:
---------------------------------

        Fix Version/s: 0.7.0
    Affects Version/s:     (was: 0.6.0)
          Component/s: Query Processor

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1361) table/partition level statistics

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1361:
-----------------------------

    Attachment: HIVE-1361.3.patch

Updated HIVE-1361.3.patch.

> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.