You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/31 10:20:09 UTC
[GitHub] [arrow-rs] alamb edited a comment on issue #641: Incorrect min/max statistics for strings in parquet files
alamb edited a comment on issue #641:
URL: https://github.com/apache/arrow-rs/issues/641#issuecomment-890325580
I have confirmed that the python parquet writer correctly stores `"tewksbury"` as the max in statistics
Using this python script:
```python
import pyarrow
import pandas as pd
data = [
"andover",
"reading",
"bedford",
"tewsbury",
"lexington",
"lawrence",
];
df = pd.DataFrame(data, columns = ['city'])
df.to_parquet('/tmp/test_python.parquet')
```
`parquet-tools` then confirm the min/max are "andover"/"tewksbury" as expected:
```shell
alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet
parquet-tools dump /tmp/test_python.parquet
row group 0
----------------------------------------------------------------------------------------------------------------------
city: BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max: [more]...
city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
------------------------------------------------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0] [more]... VC:6
BINARY city
----------------------------------------------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 6 ***
value 1: R:0 D:1 V:andover
value 2: R:0 D:1 V:reading
value 3: R:0 D:1 V:bedford
value 4: R:0 D:1 V:tewsbury
value 5: R:0 D:1 V:lexington
value 6: R:0 D:1 V:lawrence
alamb@ip-192-168-0-133 /tmp %
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org