You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/09/14 04:30:00 UTC
[jira] [Updated] (ARROW-13487) [C++][Parquet] Reading dict pages is not reading all values?

     [ https://issues.apache.org/jira/browse/ARROW-13487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-13487:
-----------------------------------
    Labels: pull-request-available  (was: )

> [C++][Parquet] Reading dict pages is not reading all values?
> ------------------------------------------------------------
>
>                 Key: ARROW-13487
>                 URL: https://issues.apache.org/jira/browse/ARROW-13487
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: generated_dictionary.parquet
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While round tripping dictionary-encoded arrays in dictionary-encoded parquet files in arrow2, I have been unable to have pyarrow read all values from the dictionary page. This contrasts with (py)spark, that can read them.
> Attached to this issue is a parquet file generated from rust's arrow2 whereby I read the IPC "generated_dictionary" file and write it into parquet (v1) with dictionary-encoding. I.e. 2 pages, one with the values, the other with the indices.
> The expected result for the column 0, "dict0" is
> {code:python}
> import pyarrow
> path = "generated_dictionary"
> golden_path = f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
> column = ("dict0", 0)
> table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
> expected = next(c for i, c in enumerate(expected) if i == column[1])
> expected = expected.combine_chunks().tolist()
> print(expected)
> # ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', None, None, 'rpc£µ£3', None, None, None, None]
> # read with pyspark
> spark = pyspark.sql.SparkSession.builder.config(
>     # see https://stackoverflow.com/a/62024670/931303
>     "spark.sql.parquet.enableVectorizedReader",
>     "false",
> ).getOrCreate()
> df = spark.read.parquet(f"{golden_path}.parquet")
> r = df.select(column[0]).collect()
> result = [r[column[0]] for r in r]
> assert expected == result
> {code}
> However, I have been unable to correctly read it from pyarrow. The result I get:
> {code:python}
> table = pq.read_table(f"{path}.parquet")
> result = table[0]
> print(result.combine_chunks().dictionary)
> print(result.combine_chunks().indices)
> [
>   "2lf4µµr",
>   "",
>   "nwg€6d€",
>   "rpc£µ£3",
>   "e£a5µ矢a"
> ]
> [
>   2,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   8,
>   null,
>   null,
>   4,
>   null,
>   null,
>   null,
>   null
> ]
> {code}
> which is incorrect as the largest index (8) is larger than the len (5) of the values.
> The indices are being read correctly, but not all values are. For clarity, the buffer in the dictionary page (PLAIN-encoded as per spec) on the attached parquet is:
> {code:python}
> # ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]
> [
> 9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
> 0, 0, 0, 0,
> 10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
> 0, 0, 0, 0
> ]
> {code}
> and the reported number of values in the dict page header is 10. I would expect all values to be read directly to the dictionary.
> We cannot discard the possibility that I am doing something wrong in writing. So far I was able to round-trip these within arrow2 and can read dict-encoded from both pyarrow and pyspark, which suggests that the arrow2 reader is correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)