You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Kevin Wilfong (Created) (JIRA)" <ji...@apache.org> on 2011/10/28 23:58:35 UTC

[jira] [Created] (HIVE-2535) Use sorted nature of compact indexes

Use sorted nature of compact indexes
------------------------------------

                 Key: HIVE-2535
                 URL: https://issues.apache.org/jira/browse/HIVE-2535
             Project: Hive
          Issue Type: Improvement
            Reporter: Kevin Wilfong
            Assignee: Kevin Wilfong


Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.

To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141408#comment-13141408 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2987
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
<https://reviews.apache.org/r/2605/#comment6699>

    do you need to override these functions ?
    They should be same as HiveRR

- namit

On 2011-10-29 01:39:50, Kevin Wilfong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2605/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-10-29 01:39:50)
bq.  
bq.  
bq.  Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.
bq.  
bq.  The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.
bq.  
bq.  Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.
bq.  
bq.  
bq.  This addresses bug HIVE-2535.
bq.      https://issues.apache.org/jira/browse/HIVE-2535
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
bq.    trunk/conf/hive-default.xml 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2605/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  I added a test to verify the functionality of the HiveBinarySearchRecordReader.
bq.  
bq.  I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.
bq.  
bq.  I ran the .q files to verify they still pass.
bq.  
bq.  I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Kevin
bq.  
bq.

> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141425#comment-13141425 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2988
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6700>

    findNonPartitionFilter modifies the conf of the FilterOperator for the index column to mark it as the one on which the data is sorted.

    See line 239

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6703>

    the boolean useSorted used at line 186 in the if statement surrounding the call to this method is set to true iff the number of index columns being filtered on is 1 see line 277.

- Kevin

On 2011-10-29 01:39:50, Kevin Wilfong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2605/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-10-29 01:39:50)
bq.  
bq.  
bq.  Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.
bq.  
bq.  The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.
bq.  
bq.  Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.
bq.  
bq.  
bq.  This addresses bug HIVE-2535.
bq.      https://issues.apache.org/jira/browse/HIVE-2535
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
bq.    trunk/conf/hive-default.xml 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2605/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  I added a test to verify the functionality of the HiveBinarySearchRecordReader.
bq.  
bq.  I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.
bq.  
bq.  I ran the .q files to verify they still pass.
bq.  
bq.  I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Kevin
bq.  
bq.

> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2535:
---------------------------------

    Component/s: Query Processor
                 Indexing
         Labels: indexing performance  (was: )
    
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>          Components: Indexing, Query Processor
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>              Labels: indexing, performance
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt, HIVE-2535.4.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2535:
--------------------------------

    Attachment: HIVE-2535.3.patch.txt
    
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141788#comment-13141788 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-02 00:07:10.165338)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

Updated hive-defalut.xml with the new default value of hive.index.compact.binary.search


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2535:
--------------------------------

    Attachment: HIVE-2535.2.patch.txt
    
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2535:
--------------------------------

    Attachment: HIVE-2535.1.patch.txt
    
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139052#comment-13139052 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141782#comment-13141782 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-01 23:56:58.062678)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

Thanks for the feedback Namit.  This diff should address all of your comments that I haven't already addressed with my own comments.

The most notable change in this diff is that the BinarySearchRecordReader can now be used for both HiveInputFormat and CombineHiveInputFormat.  Their respective record readers now inherit from BinarySearchRecordReader, and the CompactIndexHandler sets a variable representing whether or not the BinarySearchRecordReader should attempt to run its optimized search algorithm on the data.  If that variable is not set, the record readers behave exactly as they did before.

I updated the tests I added and verified they passed.  I also ran the tests most likely to be affected, notably any tests related to indexes, and verified they still passed.


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Namit Jain (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-2535:
-----------------------------

    Status: Patch Available  (was: Open)

changed to patch available
                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149864#comment-13149864 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-14 20:10:27.544678)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

This diff is in response to Yongqiang's comments.

He pointed out that if the ExprNode in a FilterOperator is a tree, rather than a single operator, the optimizations will not be applied.  This was fixed by marking the FilterOperator, and the ExprNode in the tree, where the index column is compared.  When the FilterOperator needs to execute a comparison, it traverses the tree of ExprNodes until it identifies the marked ExprNode which has the context necessary to perform the comparison.

There was also an issue where CombineHiveRecordReaders were not always being marked as sorted when they should.  This was fixed by marking them as sorted as part of the initIOContext call, instead of as part of the HiveContextAwareRecordReader constructor.

Instead of determining if the input format class supports making use of sorted input by string comparison of the class name in the CompactIndexHandler, I check whether or not the class extends HiveInputFormat.  Currently this is sufficient as all of the InputFormats introduced in Hive extend this, and I ensured that all such InputFormats do indeed support making use of sorted input.


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2535:
--------------------------------

    Attachment: HIVE-2535.4.patch.txt
    
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt, HIVE-2535.4.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2535) Use sorted nature of compact indexes

Posted by "He Yongqiang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2535:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

committed, thanks Kevin!
                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt, HIVE-2535.4.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151100#comment-13151100 ] 

Hudson commented on HIVE-2535:
------------------------------

Integrated in Hive-trunk-h0.21 #1089 (See [https://builds.apache.org/job/Hive-trunk-h0.21/1089/])
    HIVE-2535: Use sorted nature of compact indexes (Kevin Wilfong via He Yongqiang)

heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1202525
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java
* /hive/trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q
* /hive/trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out

                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt, HIVE-2535.4.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143656#comment-13143656 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-04 00:42:51.180303)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

As Namit suggested, I merged HiveBinarySearchRecordReader and HiveContextAwareRecordReader into a single class, as it seemed the binary search is closely tied to the context.

I also realized while doing this that I could update BucketizedHiveRecordReader to support binary search as well.

I updated the tests, reran the tests, and also ran tests to verify BucketizedHiveRecordReader was not broken.


Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.


This addresses bug HIVE-2535.
    https://issues.apache.org/jira/browse/HIVE-2535


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
  trunk/conf/hive-default.xml 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/2605/diff


Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 


Thanks,

Kevin


                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141404#comment-13141404 ] 

jiraposter@reviews.apache.org commented on HIVE-2535:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2974
-----------------------------------------------------------

trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/2605/#comment6684>

    The default can be true

trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6692>

    nit: spelling 

trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6685>

    More comments here.
    It would be useful to describe when is a binary search
    performed.

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6694>

    This should not be hard-coded.
    If user wanted HiveInputFormat, it should be 
    HiveSortedInputFormat and same for CombineHiveSortedInputFormat.

    Do we need a new class, or can sorted be a 
    property of input format ? Then, it should automatcally
    work for both hiveIF and combinehiveIF

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6695>

    use the term index column instead of non-partition column.

    Who is using the function findNonPartitionFilterWork.
    It is not modifying any internal structure, and the 
    return value is not used

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6696>

    I am confused - what if the filter contains multiple
    non partition column predicates ?

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java
<https://reviews.apache.org/r/2605/#comment6698>

    As mentioned before, it would be good if this also works with CombineHiveIF

- namit

On 2011-10-29 01:39:50, Kevin Wilfong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2605/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-10-29 01:39:50)
bq.  
bq.  
bq.  Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  The CompactIndexHandler determines if the reentrant query it creates is a candidate for using the fact the index is sorted (it has an appropriate number of non-partition conditions, and the query plan is of the form expected).  It sets the input format to HiveSortedInputFormat, and marks the FilterOperator for the non-partition condition.
bq.  
bq.  The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of data from a single file, and its record reader is HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming it is performing a binary search.  It sets the appropriate flags in IOContext, which acts as the means of communication between the FilterOperators and the record reader.  The non-partition FilterOperator is responsible for executing a comparison between the value in the row and column of interest and the constant.  It also provides the type of the generic UDF.  It sets this data in the IOContext.  As long as the binary search continues the FilterOperators do not forward rows to the operators below them.  The record reader uses the comparison and the type of the generic UDF to execute a binary search on the underlying RCFile until it finds the block of interest, or determines that if any block is of interest it is the last one.  The search then proceeds linearly from the beginning of the identified block.  If ever in the binary search a problem occurs, like the comparison fails for some reason, a linear search begins from the beginning of the data which has yet to be eliminated.
bq.  
bq.  Regardless of whether or not a binary search is performed, the record reader attempts to end the linear search as soon as it can based on the comparison and the type of the generic UDF.
bq.  
bq.  
bq.  This addresses bug HIVE-2535.
bq.      https://issues.apache.org/jira/browse/HIVE-2535
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
bq.    trunk/conf/hive-default.xml 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java 1183507 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java PRE-CREATION 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2605/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  I added a test to verify the functionality of the HiveBinarySearchRecordReader.
bq.  
bq.  I also added a .q file to test that this returns the correct results when the underlying index is stored in an RCFile and when it is stored in as a text file, with all of the supported operators.
bq.  
bq.  I ran the .q files to verify they still pass.
bq.  
bq.  I ran some queries to verify there was a CPU benefit to doing this.  I saw as much as a 45% reduction in the total CPU used by the map reduce job to scan the index, for a large data set. 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Kevin
bq.  
bq.

> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Posted by "He Yongqiang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150712#comment-13150712 ] 

He Yongqiang commented on HIVE-2535:
------------------------------------

looks good, running tests
                
> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt, HIVE-2535.2.patch.txt, HIVE-2535.3.patch.txt, HIVE-2535.4.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate being used to access the index consists of only one non-partition condition using one of the operators >,>=,<,<=,= we could use a binary search (if necessary) to find the block to begin scanning for unfiltered rows, and we could use the result of comparing the value in the column with the constant (this is necessarily the form of a predicate which is optimized using an index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira