You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2022/02/12 15:33:16 UTC

[GitHub] [parquet-mr] rshkv opened a new pull request #946: PARQUET-2120: CLI dictionary command should not fail on missing dictionary pages

rshkv opened a new pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
     - https://issues.apache.org/jira/browse/PARQUET-XXX
     - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain Javadoc that explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on a change in pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805173968



##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());
-        for (int i = 0; i <= dict.getMaxId(); i += 1) {
-          switch(type.getPrimitiveTypeName()) {
-            case BINARY:
-              if (type.getLogicalTypeAnnotation() instanceof LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).toStringUsingUTF8(), 70));
-              } else {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).getBytesUnsafe(), 70));
-              }
-              break;
-            case INT32:
-              console.info("{}: {}", String.format("%6d", i),
-                dict.decodeToInt(i));
-              break;
-            case INT64:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToLong(i));
-              break;
-            case FLOAT:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToFloat(i));
-              break;
-            case DOUBLE:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToDouble(i));
-              break;
-            default:
-              throw new IllegalArgumentException(
-                  "Unknown dictionary type: " + type.getPrimitiveTypeName());
-          }
+        if (page != null) {
+          console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column);
+          Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
+          printDictionary(dict, type);
+        } else {
+          console.info("\nRow group {} has no dictionary for \"{}\"", rowGroup, column);

Review comment:
       For a file mixing pages with and without dictionary encoding the output would look e.g. like this:
   ```
   Row group 0 has no dictionary for "col"
   
   Row group 1 dictionary for "col":
        0: "b"
        1: "c"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1051132018


   @shangxinli, thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1047965546


   @shangxinli anything else I can do here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on a change in pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805173716



##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());
-        for (int i = 0; i <= dict.getMaxId(); i += 1) {
-          switch(type.getPrimitiveTypeName()) {
-            case BINARY:
-              if (type.getLogicalTypeAnnotation() instanceof LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).toStringUsingUTF8(), 70));
-              } else {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).getBytesUnsafe(), 70));
-              }
-              break;
-            case INT32:
-              console.info("{}: {}", String.format("%6d", i),
-                dict.decodeToInt(i));
-              break;
-            case INT64:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToLong(i));
-              break;
-            case FLOAT:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToFloat(i));
-              break;
-            case DOUBLE:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToDouble(i));
-              break;
-            default:
-              throw new IllegalArgumentException(
-                  "Unknown dictionary type: " + type.getPrimitiveTypeName());
-          }
+        if (page != null) {

Review comment:
       This check is the crux of the change.

##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());
-        for (int i = 0; i <= dict.getMaxId(); i += 1) {
-          switch(type.getPrimitiveTypeName()) {
-            case BINARY:
-              if (type.getLogicalTypeAnnotation() instanceof LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).toStringUsingUTF8(), 70));
-              } else {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).getBytesUnsafe(), 70));
-              }
-              break;
-            case INT32:
-              console.info("{}: {}", String.format("%6d", i),
-                dict.decodeToInt(i));
-              break;
-            case INT64:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToLong(i));
-              break;
-            case FLOAT:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToFloat(i));
-              break;
-            case DOUBLE:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToDouble(i));
-              break;
-            default:
-              throw new IllegalArgumentException(
-                  "Unknown dictionary type: " + type.getPrimitiveTypeName());
-          }
+        if (page != null) {
+          console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column);
+          Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
+          printDictionary(dict, type);
+        } else {
+          console.info("\nRow group {} has no dictionary for \"{}\"", rowGroup, column);

Review comment:
       For a file mixing pages with and without dictionary encoding the output would look e.g. like this:
   ```
   Row group 0 has no dictionary for "col"
   
   Row group 1 dictionary for "col":
        0: "b"
        1: "c"
   ```

##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -122,6 +94,41 @@ public int run() throws IOException {
     return 0;
   }
 
+  private void printDictionary(Dictionary dict, PrimitiveType type) {

Review comment:
       This is just a copy-paste of the `for` block above.

##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());

Review comment:
       I removed the `page.getCompressedSize()` argument here as the log didn't have enough placeholders to display it in the first place.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1040188533


   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on a change in pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805174016



##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -122,6 +94,41 @@ public int run() throws IOException {
     return 0;
   }
 
+  private void printDictionary(Dictionary dict, PrimitiveType type) {

Review comment:
       This is just a copy-paste of the `for` block above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003


   Thanks for working on it! Can you squash the commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on a change in pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805173716



##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());
-        for (int i = 0; i <= dict.getMaxId(); i += 1) {
-          switch(type.getPrimitiveTypeName()) {
-            case BINARY:
-              if (type.getLogicalTypeAnnotation() instanceof LogicalTypeAnnotation.StringLogicalTypeAnnotation) {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).toStringUsingUTF8(), 70));
-              } else {
-                console.info("{}: {}", String.format("%6d", i),
-                    Util.humanReadable(dict.decodeToBinary(i).getBytesUnsafe(), 70));
-              }
-              break;
-            case INT32:
-              console.info("{}: {}", String.format("%6d", i),
-                dict.decodeToInt(i));
-              break;
-            case INT64:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToLong(i));
-              break;
-            case FLOAT:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToFloat(i));
-              break;
-            case DOUBLE:
-              console.info("{}: {}", String.format("%6d", i),
-                  dict.decodeToDouble(i));
-              break;
-            default:
-              throw new IllegalArgumentException(
-                  "Unknown dictionary type: " + type.getPrimitiveTypeName());
-          }
+        if (page != null) {

Review comment:
       This check is the crux of the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
shangxinli commented on pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1048018749


   I will have a look soon sometime this week. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] shangxinli merged pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
shangxinli merged pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [parquet-mr] rshkv commented on a change in pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages

Posted by GitBox <gi...@apache.org>.
rshkv commented on a change in pull request #946:
URL: https://github.com/apache/parquet-mr/pull/946#discussion_r805174327



##########
File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java
##########
@@ -75,40 +75,12 @@ public int run() throws IOException {
       while ((dictionaryReader = reader.getNextDictionaryReader()) != null) {
         DictionaryPage page = dictionaryReader.readDictionaryPage(descriptor);
 
-        Dictionary dict = page.getEncoding().initDictionary(descriptor, page);
-
-        console.info("\nRow group {} dictionary for \"{}\":", rowGroup, column, page.getCompressedSize());

Review comment:
       I removed the `page.getCompressedSize()` argument here as the log didn't have enough placeholders to display it in the first place.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org