You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/15 20:27:13 UTC

[GitHub] [arrow-datafusion] alamb commented on issue #1433: Query failing to return any results when filter is an equality check on strings

alamb commented on issue #1433:
URL: https://github.com/apache/arrow-datafusion/issues/1433#issuecomment-995187994


   This sounds similar to something we hit in IOx (https://github.com/influxdata/influxdb_iox/issues/2153) which I ultimately tracked down to a bug in the parquet statistics generation: https://github.com/apache/arrow-rs/issues/641
   
   So in this case, the statistics embedded in the parquet file for the  `direction` column are `T:[min: Merged, max: Outgoing, num_nulls not defined]`, namely that the minimum value is `"Merged"` and the maximum value is `"Outgoing"` which I do not think is correct
   
   ```shell
   $ parquet-tools meta test.parquet 
   file:        file:/Users/alamb/Downloads/test.parquet 
   creator:     UrbanLogiq 
   extra:       ARROW:schema = /////+gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAMAAAB8AAAAPAAAAAQAAACg////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABhZHQA1P///xQAAAAMAAAAAAAABQwAAAAAAAAAxP///wkAAABkaXJlY3Rpb24AAAAQABQAEAAAAA8ABAAAAAgAEAAAABgAAAAMAAAAAAAABRAAAAAAAAAABAAEAAQAAAAKAAAAdWxfbm9kZV9pZAAA 
   
   file schema: arrow_schema 
   --------------------------------------------------------------------------------
   ul_node_id:  REQUIRED BINARY L:STRING R:0 D:0
   direction:   REQUIRED BINARY L:STRING R:0 D:0
   adt:         REQUIRED INT32 R:0 D:0
   
   row group 1: RC:301 TS:3384 OFFSET:4 
   --------------------------------------------------------------------------------
   ul_node_id:   BINARY ZSTD DO:4 FPO:1796 SZ:2143/3187/1.49 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: /ehIvdei+UGfkQ4Gy5fr1w==, max: zThqpswvY6fa3VHF4BKWfw==, num_nulls not defined]
   direction:    BINARY ZSTD DO:2243 FPO:2311 SZ:195/177/0.91 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: Merged, max: Outgoing, num_nulls not defined]
   adt:          INT32 ZSTD DO:2500 FPO:3159 SZ:1046/1503/1.44 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: 15, max: 23116, num_nulls not defined]
   ```
   
   Which appears to be incorrect  for the data in test.parquet:
   
   ```
   ❯ select distinct direction from t order by direction;
   +-----------+
   | direction |
   +-----------+
   | Incoming  |
   | Merged    |
   | Outgoing  |
   | Two Way   |
   +-----------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org