You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gidon Gershinsky (Jira)" <ji...@apache.org> on 2022/05/19 06:03:00 UTC

[jira] [Updated] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

     [ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gidon Gershinsky updated PARQUET-2120:
--------------------------------------
    Fix Version/s: 1.12.3

> parquet-cli dictionary command fails on pages without dictionary encoding
> -------------------------------------------------------------------------
>
>                 Key: PARQUET-2120
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2120
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cli
>    Affects Versions: 1.12.2
>            Reporter: Willi Raschkowski
>            Priority: Minor
>             Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet                
> Unknown error
> java.lang.NullPointerException: Cannot invoke "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" is null
> 	at org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
> 	at org.apache.parquet.cli.Main.run(Main.java:155)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> 	at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet      
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> --------------------------------------------------------------------------------
>      type      encodings count     avg size   nulls   min / max
> col  BINARY    S   _     1         46.00 B    0       "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> --------------------------------------------------------------------------------
>      type      encodings count     avg size   nulls   min / max
> col  BINARY    S _ R     200       0.34 B     0       "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no dictionary encoding. But for files that mix pages with and without dictionary encoding (like above), the command will fail before getting to pages that actually have dictionaries.
> The problem is that [this line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] assumes {{readDictionaryPage}} always returns a page and doesn't handle when it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)