You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Brock Noland <br...@cloudera.com> on 2013/06/09 22:11:14 UTC

Review Request: HIVE-4113: Optimize select count(1) with RCFile and Orc

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11770/
-----------------------------------------------------------

Review request for hive.


Description
-------

Modifies ColumnProjectionUtils such there are two flags. One for the column ids and one indicating whether all columns should be read. Additionally the patch updates all locations which uses the old method of empty string indicating all columns should be read.

The automatic formatter generated by ant eclipse-files is fairly aggressive so there are some unrelated import/whitespace cleanup.


This addresses bug HIVE-4113.
    https://issues.apache.org/jira/browse/HIVE-4113


Diffs
-----

  hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java da85501 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java bc0e04c 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatRecordReader.java ac3753f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InitializeInput.java 02ec37f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InternalUtil.java 4167afa 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatMultiOutputFormat.java b5f22af 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java dd2ac10 
  hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hcatalog/pig/TestHCatLoader.java e907c73 
  ql/src/java/org/apache/hadoop/hive/ql/exec/MapredLocalTask.java 6bbcb26 
  ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1a784b2 
  ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveInputFormat.java 49145b7 
  ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java adf4923 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java d18d403 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 9521060 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java 96ac584 
  ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/RCFileBlockMergeRecordReader.java cbdc2db 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 9fc52fa 
  ql/src/test/org/apache/hadoop/hive/ql/io/PerformTestRCFileAndSeqFile.java 0df08e4 
  ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java e33a1ce 
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java 785f0b1 
  serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java 23180cf 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 11f5f07 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1335446 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStructBase.java e1270cc 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarSerDe.java b717278 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarStruct.java 0317024 
  serde/src/test/org/apache/hadoop/hive/serde2/TestColumnProjectionUtils.java PRE-CREATION 
  serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java 3ba2699 
  serde/src/test/org/apache/hadoop/hive/serde2/columnar/TestLazyBinaryColumnarSerDe.java 99420ca 

Diff: https://reviews.apache.org/r/11770/diff/


Testing
-------

All unit tests pass with the patch. ColumnProjectionUtils has new unit tests covering it's functionality. Additionally I verified manually the select count(1) from RCFile/Orc resulted in less IO after the change.

Before:

hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 17.75 sec   HDFS Read: 28782851 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 23.72 sec   HDFS Read: 825865962 HDFS Write: 9 SUCCESS

After:


hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 9.9 sec   HDFS Read: 67325 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 16.96 sec   HDFS Read: 96045618 HDFS Write: 9 SUCCESS


Thanks,

Brock Noland


Re: Review Request 11770: HIVE-4113: Optimize select count(1) with RCFile and Orc

Posted by Brock Noland <br...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11770/
-----------------------------------------------------------

(Updated July 15, 2013, 7:51 p.m.)


Review request for hive.


Changes
-------

Test was missed, included it now.


Bugs: HIVE-4113
    https://issues.apache.org/jira/browse/HIVE-4113


Repository: hive-git


Description
-------

Modifies ColumnProjectionUtils such there are two flags. One for the column ids and one indicating whether all columns should be read. Additionally the patch updates all locations which uses the old method of empty string indicating all columns should be read.

The automatic formatter generated by ant eclipse-files is fairly aggressive so there are some unrelated import/whitespace cleanup.


Diffs (updated)
-----

  hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java da85501 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java bc0e04c 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatRecordReader.java ac3753f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InitializeInput.java 02ec37f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InternalUtil.java 4167afa 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatMultiOutputFormat.java b5f22af 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java dd2ac10 
  hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hcatalog/pig/TestHCatLoader.java e907c73 
  ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1a784b2 
  ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java f72ecfb 
  ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveInputFormat.java 49145b7 
  ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java adf4923 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java d18d403 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 9521060 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java 96ac584 
  ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/RCFileBlockMergeRecordReader.java cbdc2db 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 400abf3 
  ql/src/test/org/apache/hadoop/hive/ql/io/PerformTestRCFileAndSeqFile.java fb9fca1 
  ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java ae6a5ee 
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java 785f0b1 
  serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java 23180cf 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 11f5f07 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1335446 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStructBase.java e1270cc 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarSerDe.java b717278 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarStruct.java 0317024 
  serde/src/test/org/apache/hadoop/hive/serde2/TestColumnProjectionUtils.java PRE-CREATION 
  serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java 3ba2699 
  serde/src/test/org/apache/hadoop/hive/serde2/columnar/TestLazyBinaryColumnarSerDe.java 99420ca 

Diff: https://reviews.apache.org/r/11770/diff/


Testing
-------

All unit tests pass with the patch. ColumnProjectionUtils has new unit tests covering it's functionality. Additionally I verified manually the select count(1) from RCFile/Orc resulted in less IO after the change.

Before:

hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 17.75 sec   HDFS Read: 28782851 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 23.72 sec   HDFS Read: 825865962 HDFS Write: 9 SUCCESS

After:


hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 9.9 sec   HDFS Read: 67325 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 16.96 sec   HDFS Read: 96045618 HDFS Write: 9 SUCCESS


Thanks,

Brock Noland


Re: Review Request 11770: HIVE-4113: Optimize select count(1) with RCFile and Orc

Posted by Brock Noland <br...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11770/
-----------------------------------------------------------

(Updated July 15, 2013, 7:47 p.m.)


Review request for hive.


Changes
-------

Rebased patch, no real changes.


Bugs: HIVE-4113
    https://issues.apache.org/jira/browse/HIVE-4113


Repository: hive-git


Description
-------

Modifies ColumnProjectionUtils such there are two flags. One for the column ids and one indicating whether all columns should be read. Additionally the patch updates all locations which uses the old method of empty string indicating all columns should be read.

The automatic formatter generated by ant eclipse-files is fairly aggressive so there are some unrelated import/whitespace cleanup.


Diffs (updated)
-----

  hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java da85501 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java bc0e04c 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/HCatRecordReader.java ac3753f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InitializeInput.java 02ec37f 
  hcatalog/core/src/main/java/org/apache/hcatalog/mapreduce/InternalUtil.java 4167afa 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatMultiOutputFormat.java b5f22af 
  hcatalog/core/src/test/java/org/apache/hcatalog/mapreduce/TestHCatPartitioned.java dd2ac10 
  hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hcatalog/pig/TestHCatLoader.java e907c73 
  ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1a784b2 
  ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java f72ecfb 
  ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveInputFormat.java 49145b7 
  ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java adf4923 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java d18d403 
  ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 9521060 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java 96ac584 
  ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/RCFileBlockMergeRecordReader.java cbdc2db 
  ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java 400abf3 
  ql/src/test/org/apache/hadoop/hive/ql/io/PerformTestRCFileAndSeqFile.java fb9fca1 
  ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java ae6a5ee 
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java 785f0b1 
  serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java 23180cf 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java 11f5f07 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java 1335446 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStructBase.java e1270cc 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarSerDe.java b717278 
  serde/src/java/org/apache/hadoop/hive/serde2/columnar/LazyBinaryColumnarStruct.java 0317024 
  serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java 3ba2699 
  serde/src/test/org/apache/hadoop/hive/serde2/columnar/TestLazyBinaryColumnarSerDe.java 99420ca 

Diff: https://reviews.apache.org/r/11770/diff/


Testing
-------

All unit tests pass with the patch. ColumnProjectionUtils has new unit tests covering it's functionality. Additionally I verified manually the select count(1) from RCFile/Orc resulted in less IO after the change.

Before:

hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 17.75 sec   HDFS Read: 28782851 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 23.72 sec   HDFS Read: 825865962 HDFS Write: 9 SUCCESS

After:


hive> select count(1) from users_orc;
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 9.9 sec   HDFS Read: 67325 HDFS Write: 9 SUCCESS

hive> select count(1) from users_rc; 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 16.96 sec   HDFS Read: 96045618 HDFS Write: 9 SUCCESS


Thanks,

Brock Noland