You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Tomasz Nykiel (JIRA)" <ji...@apache.org> on 2011/05/26 00:27:47 UTC

[jira] [Created] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
----------------------------------------------------------------------------------------------------------------------------

                 Key: HIVE-2185
                 URL: https://issues.apache.org/jira/browse/HIVE-2185
             Project: Hive
          Issue Type: New Feature
          Components: Serializers/Deserializers, Statistics
            Reporter: Tomasz Nykiel
            Assignee: Tomasz Nykiel


Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 

Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039928#comment-13039928 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review719
-----------------------------------------------------------



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1458>

    Isn't isValidStatics() should take "key" as a parameter rather than "rowID"? "key" should indicate which statistics this is right?



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
<https://reviews.apache.org/r/785/#comment1459>

    should be >= here


- Ning


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/785/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-26 02:52:55)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
bq.  Other statistics (e.g., total table/partition size) are derived from the file system.
bq.  
bq.  We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
bq.  On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.
bq.  
bq.  1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
bq.  We support:
bq.  
bq.  Columnar SerDe
bq.  LazySimpleSerDe
bq.  LazyBinarySerDe
bq.  
bq.  For other SerDe classes the uncompressed siez will be 0.
bq.  
bq.  2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.
bq.  
bq.  3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.
bq.  
bq.  (2) and (3) enable easy extension for other types of statistics.
bq.  
bq.  4. Collecting uncompressed size can be disabled by setting:
bq.  
bq.  hive.stats.collect.uncompressedsize = false
bq.  
bq.  
bq.  This addresses bug HIVE-2185.
bq.      https://issues.apache.org/jira/browse/HIVE-2185
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
bq.    trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
bq.    trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
bq.    trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/pcr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats11.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/union22.q.out 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1127756 
bq.    trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/785/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  - additional JUnit test for Serializer/Deserializer amended classes
bq.  - additional queries for TestCliDriver over multi-partition tables
bq.  - all other JUnit tests
bq.  - standalone setup 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044142#comment-13044142 ] 

Hudson commented on HIVE-2185:
------------------------------

Integrated in Hive-trunk-h0.21 #759 (See [https://builds.apache.org/hudson/job/Hive-trunk-h0.21/759/])
    HIVE-2185. extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) (Tomasz Nykiel via Ning Zhang)

nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1131106
Files : 
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java
* /hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java
* /hive/trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java
* /hive/trunk/hbase-handler/src/test/results/hbase_stats.q.out
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java
* /hive/trunk/hbase-handler/src/test/results/hbase_stats2.q.out
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
* /hive/trunk/hbase-handler/src/test/queries/hbase_stats.q
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/test/results/clientpositive/merge4.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
* /hive/trunk/ql/src/test/queries/clientpositive/stats15.q
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
* /hive/trunk/ql/src/test/results/clientpositive/stats14.q.out
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
* /hive/trunk/ql/src/test/results/clientpositive/union22.q.out
* /hive/trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/hbase-handler/src/test/queries/hbase_stats2.q
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
* /hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
* /hive/trunk/ql/src/test/results/clientpositive/pcr.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java
* /hive/trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out
* /hive/trunk/ql/src/test/results/clientpositive/merge3.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java
* /hive/trunk/ql/src/test/queries/clientpositive/stats14.q
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
* /hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
* /hive/trunk/ql/src/test/results/clientpositive/sample10.q.out
* /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
* /hive/trunk/ql/src/test/results/clientpositive/stats11.q.out
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java
* /hive/trunk/ql/src/test/results/clientpositive/stats15.q.out
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java
* /hive/trunk/ql/src/test/results/clientpositive/combine2.q.out


> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043236#comment-13043236 ] 

Ning Zhang commented on HIVE-2185:
----------------------------------

+1. Will commit if tests pass.

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039813#comment-13039813 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review718
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
<https://reviews.apache.org/r/785/#comment1457>

    should be:
    
    long current = 0;
    SerDeStats st = this.deserializer.getSerDeStats();
    if(st != null) {
      current = st.getUncompressedSize();
    }
    
    since we are not checking by hard which serde class is in use, and some the unsupported classes return NULL


- Tomasz


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/785/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-26 02:52:55)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
bq.  Other statistics (e.g., total table/partition size) are derived from the file system.
bq.  
bq.  We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
bq.  On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.
bq.  
bq.  1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
bq.  We support:
bq.  
bq.  Columnar SerDe
bq.  LazySimpleSerDe
bq.  LazyBinarySerDe
bq.  
bq.  For other SerDe classes the uncompressed siez will be 0.
bq.  
bq.  2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.
bq.  
bq.  3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.
bq.  
bq.  (2) and (3) enable easy extension for other types of statistics.
bq.  
bq.  4. Collecting uncompressed size can be disabled by setting:
bq.  
bq.  hive.stats.collect.uncompressedsize = false
bq.  
bq.  
bq.  This addresses bug HIVE-2185.
bq.      https://issues.apache.org/jira/browse/HIVE-2185
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
bq.    trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
bq.    trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
bq.    trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/pcr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats11.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/union22.q.out 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1127756 
bq.    trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/785/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  - additional JUnit test for Serializer/Deserializer amended classes
bq.  - additional queries for TestCliDriver over multi-partition tables
bq.  - all other JUnit tests
bq.  - standalone setup 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041897#comment-13041897 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review725
-----------------------------------------------------------



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1468>

    For a better debugging info, print out the key and the valid stats keys. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java
<https://reviews.apache.org/r/785/#comment1492>

    currentValue + amount will result in object creation. This is very expensive in the this case since this function is called for every input row. Instead of using immutable class Long, LongWritable maybe a better choice. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1501>

    Also consider using LongWritable rather than Long. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1502>

    LongWritable. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1503>

    Can you print the stack trace to LOG rather than to console?



trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java
<https://reviews.apache.org/r/785/#comment1504>

    declaration should be interface List rather than implementation ArrayList.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1506>

    Better use Map rather than HashMap in declaration



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1507>

    Can you change it to use Utilities.executeWithRetry() as well?



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
<https://reviews.apache.org/r/785/#comment1508>

    Let's also put the comment here as in other statements. 



trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out
<https://reviews.apache.org/r/785/#comment1509>

    The uncompressed size is smaller than the totalSize. Can you double check if this is because of the overhead (headers etc) in the fileformat or because of a bug in the stats?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
<https://reviews.apache.org/r/785/#comment1510>

    Please add some comments here on what it is used for.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java
<https://reviews.apache.org/r/785/#comment1511>

    Would it make sense to add the size of field delimiters as well? And if we know the record delimiters (for most record reader it is a newline), we can add that too. This will make the stats more accurately reflect the real uncompressed size stored in the file.



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
<https://reviews.apache.org/r/785/#comment1512>

    may be simpler just use == rather than ! and ^.  Also consider assert rather than returning null?



trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
<https://reviews.apache.org/r/785/#comment1513>

    same as above.


- Ning


On 2011-05-26 21:27:34, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/785/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-26 21:27:34)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
bq.  Other statistics (e.g., total table/partition size) are derived from the file system.
bq.  
bq.  We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
bq.  On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.
bq.  
bq.  1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
bq.  We support:
bq.  
bq.  Columnar SerDe
bq.  LazySimpleSerDe
bq.  LazyBinarySerDe
bq.  
bq.  For other SerDe classes the uncompressed siez will be 0.
bq.  
bq.  2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.
bq.  
bq.  3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.
bq.  
bq.  (2) and (3) enable easy extension for other types of statistics.
bq.  
bq.  4. Collecting uncompressed size can be disabled by setting:
bq.  
bq.  hive.stats.collect.uncompressedsize = false
bq.  
bq.  
bq.  This addresses bug HIVE-2185.
bq.      https://issues.apache.org/jira/browse/HIVE-2185
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1128070 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1128070 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1128070 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1128070 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1128070 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1128070 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1128070 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1128070 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
bq.    trunk/hbase-handler/src/test/queries/hbase_stats.q 1128070 
bq.    trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
bq.    trunk/hbase-handler/src/test/results/hbase_stats.q.out 1128070 
bq.    trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1128070 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1128070 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1128070 
bq.    trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/combine2.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/merge3.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/merge4.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/pcr.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/stats11.q.out 1128070 
bq.    trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/union22.q.out 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1128070 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1128070 
bq.    trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/785/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  - additional JUnit test for Serializer/Deserializer amended classes
bq.  - additional queries for TestCliDriver over multi-partition tables
bq.  - all other JUnit tests
bq.  - standalone setup 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039943#comment-13039943 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------



bq.  On 2011-05-26 21:12:30, Ning Zhang wrote:
bq.  > trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java, line 100
bq.  > <https://reviews.apache.org/r/785/diff/1/?file=19586#file19586line100>
bq.  >
bq.  >     should be >= here

Yes.


bq.  On 2011-05-26 21:12:30, Ning Zhang wrote:
bq.  > trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java, line 82
bq.  > <https://reviews.apache.org/r/785/diff/1/?file=19585#file19585line82>
bq.  >
bq.  >     Isn't isValidStatics() should take "key" as a parameter rather than "rowID"? "key" should indicate which statistics this is right?

Yes. It was a bug, I fixed already, once I ran the HBase JUnit :)


- Tomasz


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review719
-----------------------------------------------------------


On 2011-05-26 02:52:55, Tomasz Nykiel wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/785/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-26 02:52:55)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
bq.  Other statistics (e.g., total table/partition size) are derived from the file system.
bq.  
bq.  We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
bq.  On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.
bq.  
bq.  1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
bq.  We support:
bq.  
bq.  Columnar SerDe
bq.  LazySimpleSerDe
bq.  LazyBinarySerDe
bq.  
bq.  For other SerDe classes the uncompressed siez will be 0.
bq.  
bq.  2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.
bq.  
bq.  3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.
bq.  
bq.  (2) and (3) enable easy extension for other types of statistics.
bq.  
bq.  4. Collecting uncompressed size can be disabled by setting:
bq.  
bq.  hive.stats.collect.uncompressedsize = false
bq.  
bq.  
bq.  This addresses bug HIVE-2185.
bq.      https://issues.apache.org/jira/browse/HIVE-2185
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 
bq.    trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 
bq.    trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
bq.    trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
bq.    trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
bq.    trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/merge4.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/pcr.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats11.q.out 1127756 
bq.    trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/union22.q.out 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1127756 
bq.    trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1127756 
bq.    trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/785/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  - additional JUnit test for Serializer/Deserializer amended classes
bq.  - additional queries for TestCliDriver over multi-partition tables
bq.  - all other JUnit tests
bq.  - standalone setup 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Tomasz
bq.  
bq.



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043034#comment-13043034 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
-----------------------------------------------------------

(Updated 2011-06-02 20:36:48.205733)


Review request for hive.


Changes
-------

-Fixed issues pointed out in the review.
-Changed metric name to rawDataSize instead of uncompressedSize


Summary
-------

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file system.

We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
    https://issues.apache.org/jira/browse/HIVE-2185


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1130791 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1130791 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1130791 
  trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1130791 
  trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1130791 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1130791 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1130791 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/merge4.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/pcr.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/stats11.q.out 1130791 
  trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/union22.q.out 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1130791 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1130791 
  trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 

Diff: https://reviews.apache.org/r/785/diff


Testing
-------

- additional JUnit test for Serializer/Deserializer amended classes
- additional queries for TestCliDriver over multi-partition tables
- all other JUnit tests
- standalone setup 


Thanks,

Tomasz



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039946#comment-13039946 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
-----------------------------------------------------------

(Updated 2011-05-26 21:27:34.475653)


Review request for hive.


Changes
-------

-Fixed HBase stats publishing


Summary
-------

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file system.

We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
    https://issues.apache.org/jira/browse/HIVE-2185


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1128070 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1128070 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1128070 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1128070 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1128070 
  trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1128070 
  trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1128070 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1128070 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1128070 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/merge4.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/pcr.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/stats11.q.out 1128070 
  trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/union22.q.out 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1128070 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1128070 
  trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 

Diff: https://reviews.apache.org/r/785/diff


Testing
-------

- additional JUnit test for Serializer/Deserializer amended classes
- additional queries for TestCliDriver over multi-partition tables
- all other JUnit tests
- standalone setup 


Thanks,

Tomasz



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039493#comment-13039493 ] 

jiraposter@reviews.apache.org commented on HIVE-2185:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
-----------------------------------------------------------

Review request for hive.


Summary
-------

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file system.

We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
    https://issues.apache.org/jira/browse/HIVE-2185


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/pcr.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/stats11.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/union22.q.out 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java 1127756 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java 1127756 
  trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java PRE-CREATION 

Diff: https://reviews.apache.org/r/785/diff


Testing
-------

- additional JUnit test for Serializer/Deserializer amended classes
- additional queries for TestCliDriver over multi-partition tables
- all other JUnit tests
- standalone setup 


Thanks,

Tomasz



> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2185:
--------------------------------

    Attachment: HIVE-2185.patch

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2185:
--------------------------------

    Attachment: HIVE-2185.1.patch

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Carl Steinbach (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2185:
---------------------------------

    Release Note: This patch added getSerDeStats() methods to the Serializer and Deserializer interfaces. Consequently, any SerDes which were compiled against the old interfaces will need to be recompiled against the new interfaces in order to work against Hive 0.8.0.
    
> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043237#comment-13043237 ] 

Tomasz Nykiel commented on HIVE-2185:
-------------------------------------

I ran all tests. All quantities were the same as previously, but now the name of the metric changed.
Thanks.

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang resolved HIVE-2185.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0
     Hadoop Flags: [Reviewed]

Committed. Thanks Tom!

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

Posted by "Tomasz Nykiel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tomasz Nykiel updated HIVE-2185:
--------------------------------

    Attachment: HIVE-2185.2.patch

Fixed some minor issues.
Renamed the metric to rawDataSize

> extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2185
>                 URL: https://issues.apache.org/jira/browse/HIVE-2185
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers, Statistics
>            Reporter: Tomasz Nykiel
>            Assignee: Tomasz Nykiel
>         Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. 
> Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira