You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/12/03 11:04:32 UTC

[GitHub] [hive] asinkovits opened a new pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

asinkovits opened a new pull request #1730:
URL: https://github.com/apache/hive/pull/1730


   Change-Id: I6b9f501f2804aa5c71ee03cadcea3ec24fdd6005
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   There is a utility in hive which can validate/fix corrupted hive.acid.key.index.
   hive --service fixacidkeyindex
   Unfortunately it is only tailored for a specific problem (https://issues.apache.org/jira/browse/HIVE-18907), instead of generally validating and recovering the hive.acid.key.index from the stripe data itself.
   The PR generalize the validation/recovery by checking/rebuilding the hive.acid.key.index from the data itself.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   The name fixacidkeyindex suggests that this is a general validation/fixer tool for the index.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   hiveconf params can now be specified to the tool, to minimize the diff in the recovered file. (E.g: --hiveconf hive.exec.orc.default.block.padding=false)
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was diffic--ult to add.
   -->
   Unit test and manual testing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] asinkovits commented on a change in pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
asinkovits commented on a change in pull request #1730:
URL: https://github.com/apache/hive/pull/1730#discussion_r537671312



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/FixAcidKeyIndex.java
##########
@@ -163,11 +144,52 @@ static void checkFile(Configuration conf, Path inputPath) throws IOException {
       return;
     }
 
-    boolean validIndex = isAcidKeyIndexValid(reader);
+    AcidKeyIndexValidationResult validationResult = validate(conf, inputPath);
+    boolean validIndex = validationResult.isValid;
     System.out.println("Checking " + inputPath + " - acid key index is " +
         (validIndex ? "valid" : "invalid"));
   }
 
+  public static AcidKeyIndexValidationResult validate(Configuration conf, Path inputPath) throws IOException {
+    AcidKeyIndexValidationResult result = new AcidKeyIndexValidationResult();
+    FileSystem fs = inputPath.getFileSystem(conf);
+    Reader reader = OrcFile.createReader(fs, inputPath);
+    List<StripeInformation> stripes = reader.getStripes();
+    RecordIdentifier[] keyIndex = OrcRecordUpdater.parseKeyIndex(reader);
+    StructObjectInspector soi = (StructObjectInspector) reader.getObjectInspector();
+    // struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint
+
+    long rowsProcessed = 0;
+    try (RecordReader rr = reader.rows()) {
+      for(int i=0; i<stripes.size(); i++) {
+        rowsProcessed += stripes.get(i).getNumberOfRows();
+        rr.seekToRow(rowsProcessed-1);
+        OrcStruct row = (OrcStruct) rr.next(null);
+
+        List<? extends StructField> structFields = soi.getAllStructFieldRefs();
+
+        StructField transactionField = structFields.get(1);
+        StructField bucketField = structFields.get(2);

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1730:
URL: https://github.com/apache/hive/pull/1730#discussion_r535204663



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFixAcidKeyIndex.java
##########
@@ -243,6 +233,12 @@ public void testInvalidKeyIndex() throws Exception {
     checkInvalidKeyIndex(testFilePath);
     // Try fixing, this should result in new fixed file.
     fixInvalidIndex(testFilePath);
+
+    // Multiple stripes
+    createTestAcidFile(testFilePath, 12000, new FaultyKeyIndexBuilder());
+    checkInvalidKeyIndex(testFilePath);
+    // Try fixing, this should result in new fixed file.
+    fixInvalidIndex(testFilePath);

Review comment:
       Shouldn't this check if the index was actually corrected in the new file? It just checks the output of the tool.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] asinkovits commented on a change in pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
asinkovits commented on a change in pull request #1730:
URL: https://github.com/apache/hive/pull/1730#discussion_r535223946



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFixAcidKeyIndex.java
##########
@@ -243,6 +233,12 @@ public void testInvalidKeyIndex() throws Exception {
     checkInvalidKeyIndex(testFilePath);
     // Try fixing, this should result in new fixed file.
     fixInvalidIndex(testFilePath);
+
+    // Multiple stripes
+    createTestAcidFile(testFilePath, 12000, new FaultyKeyIndexBuilder());
+    checkInvalidKeyIndex(testFilePath);
+    // Try fixing, this should result in new fixed file.
+    fixInvalidIndex(testFilePath);

Review comment:
       There is a validation in the tool itself after the recovery, so if the index was still invalid, the output would be different.
   
   https://github.com/asinkovits/hive/blob/HIVE-24475/ql/src/java/org/apache/hadoop/hive/ql/io/orc/FixAcidKeyIndex.java#L265
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
pvargacl commented on pull request #1730:
URL: https://github.com/apache/hive/pull/1730#issuecomment-738722498


   LGTM +1 (non-binding)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] maheshk114 commented on pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
maheshk114 commented on pull request #1730:
URL: https://github.com/apache/hive/pull/1730#issuecomment-739319574


   +1 LGTM 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter merged pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
lcspinter merged pull request #1730:
URL: https://github.com/apache/hive/pull/1730


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
lcspinter commented on pull request #1730:
URL: https://github.com/apache/hive/pull/1730#issuecomment-741629522


   Merged into master. Thanks for the patch @asinkovits and for the review @pvargacl and @maheshk114 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] maheshk114 commented on a change in pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
maheshk114 commented on a change in pull request #1730:
URL: https://github.com/apache/hive/pull/1730#discussion_r536821113



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/FixAcidKeyIndex.java
##########
@@ -163,11 +144,52 @@ static void checkFile(Configuration conf, Path inputPath) throws IOException {
       return;
     }
 
-    boolean validIndex = isAcidKeyIndexValid(reader);
+    AcidKeyIndexValidationResult validationResult = validate(conf, inputPath);
+    boolean validIndex = validationResult.isValid;
     System.out.println("Checking " + inputPath + " - acid key index is " +
         (validIndex ? "valid" : "invalid"));
   }
 
+  public static AcidKeyIndexValidationResult validate(Configuration conf, Path inputPath) throws IOException {
+    AcidKeyIndexValidationResult result = new AcidKeyIndexValidationResult();
+    FileSystem fs = inputPath.getFileSystem(conf);
+    Reader reader = OrcFile.createReader(fs, inputPath);
+    List<StripeInformation> stripes = reader.getStripes();
+    RecordIdentifier[] keyIndex = OrcRecordUpdater.parseKeyIndex(reader);
+    StructObjectInspector soi = (StructObjectInspector) reader.getObjectInspector();
+    // struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint
+
+    long rowsProcessed = 0;
+    try (RecordReader rr = reader.rows()) {
+      for(int i=0; i<stripes.size(); i++) {
+        rowsProcessed += stripes.get(i).getNumberOfRows();
+        rr.seekToRow(rowsProcessed-1);
+        OrcStruct row = (OrcStruct) rr.next(null);
+
+        List<? extends StructField> structFields = soi.getAllStructFieldRefs();
+
+        StructField transactionField = structFields.get(1);
+        StructField bucketField = structFields.get(2);

Review comment:
       Can this be moved out of the loop ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvargacl commented on a change in pull request #1730: HIVE-24475: Generalize fixacidkeyindex utility

Posted by GitBox <gi...@apache.org>.
pvargacl commented on a change in pull request #1730:
URL: https://github.com/apache/hive/pull/1730#discussion_r535251574



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFixAcidKeyIndex.java
##########
@@ -243,6 +233,12 @@ public void testInvalidKeyIndex() throws Exception {
     checkInvalidKeyIndex(testFilePath);
     // Try fixing, this should result in new fixed file.
     fixInvalidIndex(testFilePath);
+
+    // Multiple stripes
+    createTestAcidFile(testFilePath, 12000, new FaultyKeyIndexBuilder());
+    checkInvalidKeyIndex(testFilePath);
+    // Try fixing, this should result in new fixed file.
+    fixInvalidIndex(testFilePath);

Review comment:
       Ah ok, missed that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org