You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/31 10:20:09 UTC

[GitHub] [arrow-rs] alamb edited a comment on issue #641: Incorrect min/max statistics for strings in parquet files

alamb edited a comment on issue #641:
URL: https://github.com/apache/arrow-rs/issues/641#issuecomment-890325580


   I have confirmed that the python parquet writer correctly stores `"tewksbury"` as the max in statistics
   
   Using this python script:
   ```python
   import pyarrow
   import pandas as pd
   
   data = [
       "andover",
       "reading",
       "bedford",
       "tewsbury",
       "lexington",
       "lawrence",
   ];
   
   df = pd.DataFrame(data, columns = ['city'])
   df.to_parquet('/tmp/test_python.parquet')
   ```
   
   `parquet-tools` then confirm the min/max are "andover"/"tewksbury" as expected:
   
   ```shell
   alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet 
   parquet-tools dump /tmp/test_python.parquet 
   row group 0 
   ----------------------------------------------------------------------------------------------------------------------
   city:  BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max:  [more]...
   
       city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
       ------------------------------------------------------------------------------------------------------------------
       page 0:                  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: andover, max: tewsbury, num_nulls: 0] [more]... VC:6
   
   BINARY city 
   ----------------------------------------------------------------------------------------------------------------------
   *** row group 1 of 1, values 1 to 6 *** 
   value 1: R:0 D:1 V:andover
   value 2: R:0 D:1 V:reading
   value 3: R:0 D:1 V:bedford
   value 4: R:0 D:1 V:tewsbury
   value 5: R:0 D:1 V:lexington
   value 6: R:0 D:1 V:lawrence
   alamb@ip-192-168-0-133 /tmp % 
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org