You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/13 12:23:17 UTC

[GitHub] [arrow-datafusion] franeklubi opened a new issue #1441: Incorrect results in datafusion-cli

franeklubi opened a new issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441


   **Describe the bug**
   I came upon a bug while querying my custom Parquet dataset, which causes DataFusion to produce incoherent and incorrect results.
   
   I tested my dataset in various ways, all of which produced the desired results:
   - reading parquet files using python pandas, then merging and filtering the data there
   - encoding into CSV, and reading the data with DataFusion
   - creating an SQLite database using the provided CSV files, and using the same queries there
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Download all the code and data I used for testing:
   
   [issue_data.zip](https://github.com/apache/arrow-datafusion/files/7703604/issue_data.zip)
   
   Inside there are the Parquet files and CSVs with exactly the same data (also, there's an sqlite database created from the provided CSV files).
   
   2. Use the instructions included in `README.md` to reproduce the issue:
   
   The query, that fails when querying Parquet files with datafusion-cli:
   ```sql
   -- 1. Distinct stop names
   SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   ```
   Change only in `where` from `line` to `trip_line` produces the desired results.
   
   **Expected behavior**
   Should produce these 27 rows:
   ```
   Bartnicza
   Bazyliańska
   Bolesławicka
   Brzezińska
   Budowlana
   Choszczówka
   Chłodnia
   Daniszewska
   Fabryka Pomp
   Insurekcji
   Marcelin
   Marywilska-Las
   Ołówkowa
   PKP Płudy
   PKP Żerań
   Parowozowa
   Pelcowizna
   Polnych Kwiatów
   Raciborska
   Rembielińska
   Sadkowska
   Smugowa
   Starego Dębu
   Zyndrama z Maszkowic
   os.Marywilska
   Śpiewaków
   None
   ```
   
   Query 1 from `README.md` (mentioned above) produces this incorrect set of 33 rows:
   ```
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Cygańska             |
   | Czołgistów           |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Majerankowa          |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Falenica         |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Rokosowska           |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zbójna Góra          |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   ```
   
   **Additional context**
   Datafusion version:
   ```sh
   $ datafusion-cli --version
   DataFusion 5.1.0
   ```
   
   **My guess**
   Since the Parquet files have encoded NULLs, and reading the CSV files with `datafusion-cli` gets rid of those, my best bet is on the usage of NULLs and some weir behavior when joining.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000499250


   You could try using ComplexObjectArrayReader to decode the column instead of ArrowArrayReader. This might help narrow down if the bug lies in the RLE decoding or something higher up in ArrowArrayReader. Alternatively you could see if the issue occurs with https://github.com/apache/arrow-rs/pull/1082  which replaces ArrowArrayReader with an alternative that shares more with the PrimitiveArrayReader/ComplexObjectArrayReader implementations, again this might help narrow down the origin.
   
   Not at a computer at the moment, otherwise would try myself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] franeklubi commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

franeklubi commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992599002


   I've updated the `README.md` of `issue_data.zip`, to account for the header rows in csv files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992607329


   thanks for the update.
   
   to people reading this, i'm still trying to minimize the reproduction steps, so i guess below is a simpler statement:
   
   ```sql
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION './parquets/stops';
   ```
   
   ```sql
   CREATE EXTERNAL TABLE stop_csv (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV LOCATION './csvs/stop.csv';
   ```
   
   ```
   ❯ select distinct stop_name from stop_csv;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Szczęśliwice                 |
   | Wawelska                     |
   | Bolesławicka                 |
   ...
   | Ceramiczna                   |
   | Czołgistów                   |
   +------------------------------+
   134 rows in set. Query took 0.015 seconds.
   ```
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   ...
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```
   
   i.e. even without join, we can tell that parquet and csv reads differently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992620238


   seems like these values are absent in parquet readings:
   
   ```
   ❯ select stop_name from stop_csv except select stop_name from stop_parquet order by stop_name;
   +------------------------+
   | stop_name              |
   +------------------------+
   |                        |
   | CH Promenada           |
   | Centrum                |
   | Ciołkosza              |
   | Kanał Gocławski        |
   | Marszałkowska          |
   | Marysin                |
   | Metro Politechnika     |
   | Metro Świętokrzyska    |
   | Odkryta                |
   | Okularowa              |
   | Poligonowa             |
   | Przyczółek Grochowski  |
   | Rezedowa               |
   | Rozbrat                |
   | Saska                  |
   | Zajezdnia Ostrobramska |
   | Zamieniecka            |
   | pl.Bankowy             |
   | pl.Konstytucji         |
   | pl.Na Rozdrożu         |
   | stop_name              |
   | Łysakowska             |
   +------------------------+
   23 rows in set. Query took 0.019 seconds.
   ```
   
   and the scanning process was invalid:
   
   ```
   ❯ select distinct stop_name from stop_parquet where stop_name = 'Odkryta';
   +-----------+
   | stop_name |
   +-----------+
   | Odkryta   |
   +-----------+
   1 row in set. Query took 0.007 seconds.
   ```
   
   notice the `Odkryta` row would be missing in the statement below.
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   |                              |
   | Bystra                       |
   | Dobosza                      |
   | Urbanistów                   |
   | Pelcowizna                   |
   | PUSTELNIK                    |
   | Strzeleckiego                |
   | Bełdan                       |
   | Kijowska                     |
   | Świątynia Opatrzności Bożej  |
   | Białołęka-Ratusz             |
   | Armatnia                     |
   | Muranowska                   |
   | Rokosowska                   |
   | Choszczówka                  |
   | os.Marywilska                |
   | Metro Stokłosy               |
   | Leśnej Polanki               |
   | Ćwiklińskiej                 |
   | Nowodwory                    |
   | Stawki                       |
   | PKP Falenica                 |
   | Bazyliańska                  |
   | Bartnicza                    |
   | ARMATNIA                     |
   | Metro Ratusz-Arsenał         |
   | Ćmielowska                   |
   | Myśliborska                  |
   | Oś Królewska                 |
   | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   | pl.Inwalidów                 |
   | Opaczewska                   |
   | Rudzka                       |
   | Nowolipie                    |
   | Polnych Kwiatów              |
   | PKP Żerań                    |
   | Sarmacka                     |
   | rondo Zesłańców Syberyjskich |
   | Leszno                       |
   | Gorzykowska                  |
   | Miła                         |
   | pl.Narutowicza               |
   | pl.Zawiszy                   |
   | Majerankowa                  |
   | Daniszewska                  |
   | PKP Olszynka Grochowska      |
   | Stare Miasto                 |
   | Cm.Wolski                    |
   | Metro Stadion Narodowy       |
   | Małych Dębów                 |
   | Marywilska-Las               |
   | Powsinek                     |
   | Branickiego                  |
   | Vogla                        |
   | Olesin                       |
   | Ceramiczna                   |
   | CH Marki                     |
   | Dw.Wileński                  |
   | Hala Kopińska                |
   | Inflancka                    |
   | Klaudyny                     |
   | Chłodna                      |
   | Zbójna Góra                  |
   | Czołgistów                   |
   | Sadkowska                    |
   | Insurekcji                   |
   | Brzezińska                   |
   | Ołówkowa                     |
   | Smugowa                      |
   | Dąbrówka Wiślana             |
   | Żerań FSO                    |
   | Kino Femina                  |
   | gen.Zajączka                 |
   | Wola-Ratusz                  |
   | Bohomolca                    |
   | Chłodnia                     |
   | Starego Dębu                 |
   | Raciborska                   |
   | Rembielińska                 |
   | Wałbrzyska-Cmentarz          |
   | EC Żerań                     |
   | os.Potok                     |
   | Metro Księcia Janusza        |
   | Dw.Gdański                   |
   | Mennica                      |
   | Norblin                      |
   | Sienna                       |
   | Gwiaździsta                  |
   | Sobocka                      |
   | Marymont-Potok               |
   | Śpiewaków                    |
   | Fabryka Pomp                 |
   | METRO RATUSZ-ARSENAŁ         |
   | Osiedle                      |
   | Szczęśliwice                 |
   | Szwedzka                     |
   | pl.Starynkiewicza            |
   | Wawelska                     |
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   | Smocza                       |
   | pl.Wilsona                   |
   | Cygańska                     |
   | Parowozowa                   |
   | Zyndrama z Maszkowic         |
   | Parafialna                   |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992620238


   seems like these values are absent in parquet readings:
   
   ```
   ❯ select stop_name from stop_csv except select stop_name from stop_parquet order by stop_name;
   +------------------------+
   | stop_name              |
   +------------------------+
   |                        |
   | CH Promenada           |
   | Centrum                |
   | Ciołkosza              |
   | Kanał Gocławski        |
   | Marszałkowska          |
   | Marysin                |
   | Metro Politechnika     |
   | Metro Świętokrzyska    |
   | Odkryta                |
   | Okularowa              |
   | Poligonowa             |
   | Przyczółek Grochowski  |
   | Rezedowa               |
   | Rozbrat                |
   | Saska                  |
   | Zajezdnia Ostrobramska |
   | Zamieniecka            |
   | pl.Bankowy             |
   | pl.Konstytucji         |
   | pl.Na Rozdrożu         |
   | stop_name              |
   | Łysakowska             |
   +------------------------+
   23 rows in set. Query took 0.019 seconds.
   ```
   
   and the scanning process was invalid:
   
   ```
   ❯ select distinct stop_name from stop_parquet where stop_name = 'Odkryta';
   +-----------+
   | stop_name |
   +-----------+
   | Odkryta   |
   +-----------+
   1 row in set. Query took 0.007 seconds.
   ```
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   |                              |
   | Bystra                       |
   | Dobosza                      |
   | Urbanistów                   |
   | Pelcowizna                   |
   | PUSTELNIK                    |
   | Strzeleckiego                |
   | Bełdan                       |
   | Kijowska                     |
   | Świątynia Opatrzności Bożej  |
   | Białołęka-Ratusz             |
   | Armatnia                     |
   | Muranowska                   |
   | Rokosowska                   |
   | Choszczówka                  |
   | os.Marywilska                |
   | Metro Stokłosy               |
   | Leśnej Polanki               |
   | Ćwiklińskiej                 |
   | Nowodwory                    |
   | Stawki                       |
   | PKP Falenica                 |
   | Bazyliańska                  |
   | Bartnicza                    |
   | ARMATNIA                     |
   | Metro Ratusz-Arsenał         |
   | Ćmielowska                   |
   | Myśliborska                  |
   | Oś Królewska                 |
   | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   | pl.Inwalidów                 |
   | Opaczewska                   |
   | Rudzka                       |
   | Nowolipie                    |
   | Polnych Kwiatów              |
   | PKP Żerań                    |
   | Sarmacka                     |
   | rondo Zesłańców Syberyjskich |
   | Leszno                       |
   | Gorzykowska                  |
   | Miła                         |
   | pl.Narutowicza               |
   | pl.Zawiszy                   |
   | Majerankowa                  |
   | Daniszewska                  |
   | PKP Olszynka Grochowska      |
   | Stare Miasto                 |
   | Cm.Wolski                    |
   | Metro Stadion Narodowy       |
   | Małych Dębów                 |
   | Marywilska-Las               |
   | Powsinek                     |
   | Branickiego                  |
   | Vogla                        |
   | Olesin                       |
   | Ceramiczna                   |
   | CH Marki                     |
   | Dw.Wileński                  |
   | Hala Kopińska                |
   | Inflancka                    |
   | Klaudyny                     |
   | Chłodna                      |
   | Zbójna Góra                  |
   | Czołgistów                   |
   | Sadkowska                    |
   | Insurekcji                   |
   | Brzezińska                   |
   | Ołówkowa                     |
   | Smugowa                      |
   | Dąbrówka Wiślana             |
   | Żerań FSO                    |
   | Kino Femina                  |
   | gen.Zajączka                 |
   | Wola-Ratusz                  |
   | Bohomolca                    |
   | Chłodnia                     |
   | Starego Dębu                 |
   | Raciborska                   |
   | Rembielińska                 |
   | Wałbrzyska-Cmentarz          |
   | EC Żerań                     |
   | os.Potok                     |
   | Metro Księcia Janusza        |
   | Dw.Gdański                   |
   | Mennica                      |
   | Norblin                      |
   | Sienna                       |
   | Gwiaździsta                  |
   | Sobocka                      |
   | Marymont-Potok               |
   | Śpiewaków                    |
   | Fabryka Pomp                 |
   | METRO RATUSZ-ARSENAŁ         |
   | Osiedle                      |
   | Szczęśliwice                 |
   | Szwedzka                     |
   | pl.Starynkiewicza            |
   | Wawelska                     |
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   | Smocza                       |
   | pl.Wilsona                   |
   | Cygańska                     |
   | Parowozowa                   |
   | Zyndrama z Maszkowic         |
   | Parafialna                   |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000460723


   This discrepancy looks like it comes out of `VariableLenDictionaryDecoder` which seems to have been introduced by @yordan-pavlov  in  https://github.com/apache/arrow-rs/pull/384
   
   I have not studied the code enough yet to fully understand what it is doing, and I need to attend to some other items now. If anyone has ideas (cc @tustvold ) on where to look next I would appreciate it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-999872497


   I plan to focus on this issue tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996265753


   I spent some time looking at the parquet data and the csv data, and it looks to me like there may be something wrong with the parquet reader. 
   
   Specifically, I just ran a query that did a select *
   
   dump_parquet.sql:
   ```sql
   CREATE EXTERNAL TABLE stops_parquet
   STORED AS PARQUET
   LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/stops';
   
   show columns from stops_parquet;
   
   CREATE EXTERNAL TABLE trips_parquet
   STORED AS PARQUET
   LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/trips';
   
   show columns from trips_parquet;
   
   select * from stops_parquet order by time, trip_tid, trip_line, stop_name;
   select * from trips_parquet order by tid, line, base_day;
   ```
   
   and `dump_csv.sql`:
   
   ```sql
   CREATE EXTERNAL TABLE stops_csv (time timestamp, trip_tid bigint, trip_line TEXT, stop_name TEXT)
   STORED AS CSV WITH HEADER ROW
   LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/stop.csv';
   
   show columns from stops_csv;
   
   CREATE EXTERNAL TABLE trips_csv (tid bigint, line TEXT, base_day date)
   STORED AS CSV WITH HEADER ROW
   LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/trip.csv';
   
   show columns from trips_csv;
   
   select * from stops_csv order by time, trip_tid, trip_line, stop_name;
   select * from trips_csv order by tid, line, base_day;
   ```
   
   Like this:
   
   ```shell
   ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_csv.sql  > dump_csv.txt
   ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_parquet.sql  > dump_parquet.txt
   ```
   
   The results are here: 
   [dump_csv.txt](https://github.com/apache/arrow-datafusion/files/7730874/dump_csv.txt)
   [dump_parquet.txt](https://github.com/apache/arrow-datafusion/files/7730875/dump_parquet.txt)
   
   And a quick visual diff shows they aren't the same
   
   The first few lines of `dump_csv.txt` look like
   ```
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 05:00:00 | 54761677 | N64       |                              |
   | 2021-11-15 05:00:00 | 54778942 | 204       |                              |
   | 2021-11-15 05:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   | 2021-11-15 05:00:00 | 54788967 | N41       |                              |
   | 2021-11-15 05:00:00 | 54788988 | N41       |                              |
   | 2021-11-15 05:00:00 | 54802937 | 104       |                              |
   ```
   
   While the first few lines of `dump_parquet.txt` l look like:
   
   ```
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 00:00:00 | 54761677 | N64       |                              |
   | 2021-11-15 00:00:00 | 54778942 | 204       |                              |
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia                     |
   | 2021-11-15 00:00:00 | 54788967 | N41       |                              |
   | 2021-11-15 00:00:00 | 54788988 | N41       |                              |
   | 2021-11-15 00:00:00 | 54802937 | 104       |                              |
   ```
   
   (note that the stop name is different)
   
   However, when I look for that mismatched line `trip_tid=54788307` in the data using pandas it does match with the csv:
   
   Here is the raw data in csv:
   ```shell
   $ grep 54788307 issue_data/csvs/stop.csv
   2021-11-15 00:00:00,54788307,186,RONDO ZESŁAŃCÓW SYBERYJSKICH
   
   ```
   
   Here is what comes out when using pandas:
   
   ```python
   Python 3.8.12 (default, Oct 13 2021, 06:42:42) 
   [Clang 13.0.0 (clang-1300.0.29.3)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pandas as pd
   import pandas as pd
   >>> df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet')
   df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet')
   >>> df.to_csv('/tmp/2021-11.csv')
   df.to_csv('/tmp/2021-11.csv')
   >>> 
   ```
   
   ```shell
   $ grep 54788307 /tmp/2021-11.csv 
   16591,2021-11-15 00:00:00,54788307.0,186,RONDO ZESŁAŃCÓW SYBERYJSKICH
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1038174703


   I reran the queries from @franeklubi at 2e918184c502a40a041fb0163702cb6ab8de0af9:
   
   # Setup
   ```sql
   CREATE EXTERNAL TABLE stop (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/stop.csv';
   CREATE EXTERNAL TABLE trip (tid TEXT, line TEXT, base_day TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/trip.csv';
   
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/stops/';
   CREATE EXTERNAL TABLE trip_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/trips/';
   ```
   
   Now the results are consistent when using csv and parquet:
   
   ## Parquet
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop_parquet INNER JOIN trip_parquet ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   27 rows in set. Query took 0.042 seconds.
   ```
   
   ## Csv
   
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   |                      |
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   +----------------------+
   27 rows in set. Query took 0.116 seconds.
   ```
   
   Interestingly, the CSV results don't seem to have the `NULLS LAST` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] franeklubi commented on issue #1441: Incorrect results in datafusion-cli

Posted by GitBox <gi...@apache.org>.

franeklubi commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992446558


   [no_nulls.zip](https://github.com/apache/arrow-datafusion/files/7703926/no_nulls.zip)
   
   I removed NULLs from Parquet files (missing strings got replaced with "NULL" and ints with 0).
   
   Querying these files produces the desired results, which further confirms my theory, that there is something wrong with NULL handling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992609965


   thinking out loud: doing `pandas` validation:
   
   ```python
   In [10]: import pandas as pd
   
   In [11]: csv_pd = pd.read_csv('./csvs/stop.csv')
   
   In [12]: pq_pd = pd.read_parquet('./parquets/stops')
   
   In [13]: csv_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  object
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: float64(1), object(3)
   memory usage: 1.0+ MB
   
   In [14]: pq_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  datetime64[ns]
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: datetime64[ns](1), float64(1), object(2)
   memory usage: 1.0+ MB
   ``` 
   
   and at least the `stop_name` columns are the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992620238


   seems like these values are absent in parquet readings:
   
   ```
   ❯ select stop_name from stop_csv except select stop_name from stop_parquet order by stop_name;
   +------------------------+
   | stop_name              |
   +------------------------+
   |                        |
   | CH Promenada           |
   | Centrum                |
   | Ciołkosza              |
   | Kanał Gocławski        |
   | Marszałkowska          |
   | Marysin                |
   | Metro Politechnika     |
   | Metro Świętokrzyska    |
   | Odkryta                |
   | Okularowa              |
   | Poligonowa             |
   | Przyczółek Grochowski  |
   | Rezedowa               |
   | Rozbrat                |
   | Saska                  |
   | Zajezdnia Ostrobramska |
   | Zamieniecka            |
   | pl.Bankowy             |
   | pl.Konstytucji         |
   | pl.Na Rozdrożu         |
   | stop_name              |
   | Łysakowska             |
   +------------------------+
   23 rows in set. Query took 0.019 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992620238


   seems like these values are absent in parquet readings:
   
   ```
   ❯ select stop_name from stop_csv except select stop_name from stop_parquet order by stop_name;
   +------------------------+
   | stop_name              |
   +------------------------+
   |                        |
   | CH Promenada           |
   | Centrum                |
   | Ciołkosza              |
   | Kanał Gocławski        |
   | Marszałkowska          |
   | Marysin                |
   | Metro Politechnika     |
   | Metro Świętokrzyska    |
   | Odkryta                |
   | Okularowa              |
   | Poligonowa             |
   | Przyczółek Grochowski  |
   | Rezedowa               |
   | Rozbrat                |
   | Saska                  |
   | Zajezdnia Ostrobramska |
   | Zamieniecka            |
   | pl.Bankowy             |
   | pl.Konstytucji         |
   | pl.Na Rozdrożu         |
   | stop_name              |
   | Łysakowska             |
   +------------------------+
   23 rows in set. Query took 0.019 seconds.
   ```
   
   and the scanning process was invalid:
   
   ```
   ❯ select distinct stop_name from stop_parquet where stop_name = 'Odkryta';
   +-----------+
   | stop_name |
   +-----------+
   | Odkryta   |
   +-----------+
   1 row in set. Query took 0.007 seconds.
   ```
   
   notice the `Odkryta` row would be missing in the statement below.
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   |                              |
   | Bystra                       |
   | Dobosza                      |
   | Urbanistów                   |
   | Pelcowizna                   |
   | PUSTELNIK                    |
   | Strzeleckiego                |
   | Bełdan                       |
   | Kijowska                     |
   | Świątynia Opatrzności Bożej  |
   | Białołęka-Ratusz             |
   | Armatnia                     |
   | Muranowska                   |
   | Rokosowska                   |
   | Choszczówka                  |
   | os.Marywilska                |
   | Metro Stokłosy               |
   | Leśnej Polanki               |
   | Ćwiklińskiej                 |
   | Nowodwory                    |
   | Stawki                       |
   | PKP Falenica                 |
   | Bazyliańska                  |
   | Bartnicza                    |
   | ARMATNIA                     |
   | Metro Ratusz-Arsenał         |
   | Ćmielowska                   |
   | Myśliborska                  |
   | Oś Królewska                 |
   | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   | pl.Inwalidów                 |
   | Opaczewska                   |
   | Rudzka                       |
   | Nowolipie                    |
   | Polnych Kwiatów              |
   | PKP Żerań                    |
   | Sarmacka                     |
   | rondo Zesłańców Syberyjskich |
   | Leszno                       |
   | Gorzykowska                  |
   | Miła                         |
   | pl.Narutowicza               |
   | pl.Zawiszy                   |
   | Majerankowa                  |
   | Daniszewska                  |
   | PKP Olszynka Grochowska      |
   | Stare Miasto                 |
   | Cm.Wolski                    |
   | Metro Stadion Narodowy       |
   | Małych Dębów                 |
   | Marywilska-Las               |
   | Powsinek                     |
   | Branickiego                  |
   | Vogla                        |
   | Olesin                       |
   | Ceramiczna                   |
   | CH Marki                     |
   | Dw.Wileński                  |
   | Hala Kopińska                |
   | Inflancka                    |
   | Klaudyny                     |
   | Chłodna                      |
   | Zbójna Góra                  |
   | Czołgistów                   |
   | Sadkowska                    |
   | Insurekcji                   |
   | Brzezińska                   |
   | Ołówkowa                     |
   | Smugowa                      |
   | Dąbrówka Wiślana             |
   | Żerań FSO                    |
   | Kino Femina                  |
   | gen.Zajączka                 |
   | Wola-Ratusz                  |
   | Bohomolca                    |
   | Chłodnia                     |
   | Starego Dębu                 |
   | Raciborska                   |
   | Rembielińska                 |
   | Wałbrzyska-Cmentarz          |
   | EC Żerań                     |
   | os.Potok                     |
   | Metro Księcia Janusza        |
   | Dw.Gdański                   |
   | Mennica                      |
   | Norblin                      |
   | Sienna                       |
   | Gwiaździsta                  |
   | Sobocka                      |
   | Marymont-Potok               |
   | Śpiewaków                    |
   | Fabryka Pomp                 |
   | METRO RATUSZ-ARSENAŁ         |
   | Osiedle                      |
   | Szczęśliwice                 |
   | Szwedzka                     |
   | pl.Starynkiewicza            |
   | Wawelska                     |
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   | Smocza                       |
   | pl.Wilsona                   |
   | Cygańska                     |
   | Parowozowa                   |
   | Zyndrama z Maszkowic         |
   | Parafialna                   |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```
   
   Given the investigation above I believe this might be related to the single distinct to group by optimization or the hash aggregation steps.
   
   cc @Dandandan @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000499250


   You could try using ComplexObjectArrayReader to decode the column instead of ArrowArrayReader. This might help narrow down if the bug lies in the RLE decoding or something higher up in ArrowArrayReader. Alternatively you could see if the issue occurs with #1082 which replaces ArrowArrayReader with an alternative that shares more with the PrimitiveArrayReader/ComplexObjectArrayReader implementations, again this might help narrow down the origin.
   
   Not at a computer at the moment, otherwise would try myself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1013688365


   I believe with the update to use arrow 7.0.0 which contains @yordan-pavlov 's fix, this should now be fixed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-999873633


   A guess: it might be an issue with reading statistics / predicate push down?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000461707


   @alamb this reads like a detective 🕵️‍♂️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1038174703


   I reran the queries from @franeklubi :
   
   # Setup
   ```sql
   CREATE EXTERNAL TABLE stop (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/stop.csv';
   CREATE EXTERNAL TABLE trip (tid TEXT, line TEXT, base_day TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/trip.csv';
   
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/stops/';
   CREATE EXTERNAL TABLE trip_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/trips/';
   ```
   
   Now the results are consistent when using csv and parquet:
   
   ## Parquet
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop_parquet INNER JOIN trip_parquet ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   27 rows in set. Query took 0.042 seconds.
   ```
   
   ## Csv
   
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   |                      |
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   +----------------------+
   27 rows in set. Query took 0.116 seconds.
   ```
   
   Interestingly, the CSV results don't seem to have the `NULLS LAST` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000806425


   Thanks @tustvold  and @jorgecarleitao  for the tips. Will give them a try. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb closed issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996655155


   When I just do select * the answer is not correct either:
   
   ```sql
   ❯ select * from stops_parquet;
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   ...
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia                     |
   ...
   +---------------------+----------+-----------+------------------------------+
   33254 rows in set. Query took 1.553 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996957833


   I have a smaller reproducer and am trying to narrow down where the problem is but probably won't be able to work on this until next week sometime at the earliest
   
   In case anyone else is interested, here is the repo:  [repro.zip](https://github.com/apache/arrow-datafusion/files/7737234/repro.zip)
   
   ```shell
   cargo run --bin datafusion-cli -- -f repro.sql | grep 54788307
   ```
   
   It should print out
   ```
   | 2021-11-15 00:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   ```
   
   But actually prints out  
   ```
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992609965


   thinking out loud: doing `pandas` validation:
   
   ```python
   In [10]: import pandas as pd
   
   In [11]: csv_pd = pd.read_csv('./csvs/stop.csv')
   
   In [12]: pq_pd = pd.read_parquet('./parquets/stops')
   
   In [13]: csv_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  object
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: float64(1), object(3)
   memory usage: 1.0+ MB
   
   In [14]: pq_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  datetime64[ns]
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: datetime64[ns](1), float64(1), object(2)
   memory usage: 1.0+ MB
   
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996268224


   Oh weird, but then a direct query
   
   ```
   select * from stops_parquet where trip_tid=54788307;
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 00:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   +---------------------+----------+-----------+------------------------------+
   ```
   Seems to get the correct answer 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-995190312


   I'll try and check this out in more detail tomorrow. Thanks for the investigation @Jimexist 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1013733967


   Ye that is my understanding -- someone just needs to rerun the (wonderful) reproducer from @franeklubi  to confirm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000360146


   The plot thickens! 🕵️  
   
   Regarding parquet predicate pruning, amusingly in this case, I think row group pruning actually helps avoid the problem. As you may recall, when a filter is applied like this
   
   ```sql
   select * from stops_parquet where trip_tid=54788307;
   ```
   
   The answer is correct (`stop_name` is `"RONDO ZESŁAŃCÓW SYBERYJSKICH"`):
   ```sql
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 00:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   +---------------------+----------+-----------+------------------------------+
   ```
   
   However, when I disable pruning then the wrong answer comes out!
   ```
   +---------------------+----------+-----------+-----------+
   | time                | trip_tid | trip_line | stop_name |
   +---------------------+----------+-----------+-----------+
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia  |
   +---------------------+----------+-----------+-----------+
   ```
   
   I added some debugging, and verified that the query does in fact skip several row groups:
   
   ```
   Row Group[0], col  ReportStopRecord: pruned = true
   Row Group[1], col  ReportStopRecord: pruned = false
   Row Group[2], col  ReportStopRecord: pruned = false
   Row Group[3], col  ReportStopRecord: pruned = false
   Row Group[4], col  ReportStopRecord: pruned = false
   Row Group[5], col  ReportStopRecord: pruned = false
   Row Group[6], col  ReportStopRecord: pruned = false
   Row Group[7], col  ReportStopRecord: pruned = false
   Row Group[8], col  ReportStopRecord: pruned = false
   Row Group[9], col  ReportStopRecord: pruned = true
   Row Group[10], col  ReportStopRecord: pruned = false
   Row Group[11], col  ReportStopRecord: pruned = false
   Row Group[12], col  ReportStopRecord: pruned = false
   Row Group[13], col  ReportStopRecord: pruned = false
   Row Group[14], col  ReportStopRecord: pruned = false
   Row Group[15], col  ReportStopRecord: pruned = false
   Row Group[16], col  ReportStopRecord: pruned = false
   Row Group[17], col  ReportStopRecord: pruned = false
   Row Group[18], col  ReportStopRecord: pruned = false
   Row Group[19], col  ReportStopRecord: pruned = false
   Row Group[20], col  ReportStopRecord: pruned = false
   Row Group[21], col  ReportStopRecord: pruned = false
   Row Group[22], col  ReportStopRecord: pruned = false
   Row Group[23], col  ReportStopRecord: pruned = false
   Row Group[24], col  ReportStopRecord: pruned = false
   Row Group[25], col  ReportStopRecord: pruned = false
   Row Group[26], col  ReportStopRecord: pruned = false
   Row Group[27], col  ReportStopRecord: pruned = false
   Row Group[28], col  ReportStopRecord: pruned = false
   Row Group[29], col  ReportStopRecord: pruned = false
   Row Group[30], col  ReportStopRecord: pruned = false
   Row Group[31], col  ReportStopRecord: pruned = false
   Row Group[32], col  ReportStopRecord: pruned = true
   Row Group[33], col  ReportStopRecord: pruned = false
   Row Group[34], col  ReportStopRecord: pruned = true
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000499250


   You could try using ComplexObjectArrayReader to decode the column instead of ArrowArrayReader. This might help narrow down if the bug lies in the RLE decoding or something higher up in ArrowArrayReader. Alternatively you could see if the issue occurs with #1082 which replaces ArrowArrayReader with an alternative that shares more with the primitive array implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000360146


   The plot thickens! 🕵️  
   
   Regarding parquet predicate pruning, amusingly in this case, I think row group pruning actually helps avoid the problem. As you may recall, when a filter is applied like this
   
   ```sql
   select * from stops_parquet where trip_tid=54788307;
   ```
   
   The answer is correct (`stop_name` is `"RONDO ZESŁAŃCÓW SYBERYJSKICH"`):
   ```sql
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 00:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   +---------------------+----------+-----------+------------------------------+
   ```
   
   However, when I disable pruning then the wrong answer comes out!
   ```sql
   +---------------------+----------+-----------+-----------+
   | time                | trip_tid | trip_line | stop_name |
   +---------------------+----------+-----------+-----------+
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia  |
   +---------------------+----------+-----------+-----------+
   ```
   
   I added some debugging, and verified that the query does in fact skip several row groups:
   
   ```
   Row Group[0], col  ReportStopRecord: pruned = true
   Row Group[1], col  ReportStopRecord: pruned = false
   Row Group[2], col  ReportStopRecord: pruned = false
   Row Group[3], col  ReportStopRecord: pruned = false
   Row Group[4], col  ReportStopRecord: pruned = false
   Row Group[5], col  ReportStopRecord: pruned = false
   Row Group[6], col  ReportStopRecord: pruned = false
   Row Group[7], col  ReportStopRecord: pruned = false
   Row Group[8], col  ReportStopRecord: pruned = false
   Row Group[9], col  ReportStopRecord: pruned = true
   Row Group[10], col  ReportStopRecord: pruned = false
   Row Group[11], col  ReportStopRecord: pruned = false
   Row Group[12], col  ReportStopRecord: pruned = false
   Row Group[13], col  ReportStopRecord: pruned = false
   Row Group[14], col  ReportStopRecord: pruned = false
   Row Group[15], col  ReportStopRecord: pruned = false
   Row Group[16], col  ReportStopRecord: pruned = false
   Row Group[17], col  ReportStopRecord: pruned = false
   Row Group[18], col  ReportStopRecord: pruned = false
   Row Group[19], col  ReportStopRecord: pruned = false
   Row Group[20], col  ReportStopRecord: pruned = false
   Row Group[21], col  ReportStopRecord: pruned = false
   Row Group[22], col  ReportStopRecord: pruned = false
   Row Group[23], col  ReportStopRecord: pruned = false
   Row Group[24], col  ReportStopRecord: pruned = false
   Row Group[25], col  ReportStopRecord: pruned = false
   Row Group[26], col  ReportStopRecord: pruned = false
   Row Group[27], col  ReportStopRecord: pruned = false
   Row Group[28], col  ReportStopRecord: pruned = false
   Row Group[29], col  ReportStopRecord: pruned = false
   Row Group[30], col  ReportStopRecord: pruned = false
   Row Group[31], col  ReportStopRecord: pruned = false
   Row Group[32], col  ReportStopRecord: pruned = true
   Row Group[33], col  ReportStopRecord: pruned = false
   Row Group[34], col  ReportStopRecord: pruned = true
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992607329


   thanks for the update.
   
   to people reading this, i'm still trying to minimize the reproduction steps, so i guess below is a simpler statement:
   
   ```
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION './parquets/stops';
   CREATE EXTERNAL TABLE stop_csv (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV LOCATION './csvs/stop.csv';
   ```
   
   ```
   ❯ select distinct stop_name from stop_csv;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Szczęśliwice                 |
   | Wawelska                     |
   | Bolesławicka                 |
   ...
   | Ceramiczna                   |
   | Czołgistów                   |
   +------------------------------+
   134 rows in set. Query took 0.015 seconds.
   ```
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   ...
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```
   
   i.e. even without join, we can tell that parquet and csv reads differently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] franeklubi commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

franeklubi commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992590365


   Hi @Jimexist! Thanks for the reply
   
   The issue results from the headers kept in the CSV files. That's my bad - please remove them before testing.
   
   I will amend the issue info, so others won't get the same problem as you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000479749


   I can also open the parquet files from arrow2. I think that this is something on the parquet crate.
   
   the below pasted in [this example](https://github.com/jorgecarleitao/arrow2/blob/main/examples/parquet_read_record.rs):
   ```rust
   let mut distinct = HashSet::<String>::new();
       let start = SystemTime::now();
       for maybe_batch in reader {
           let batch = maybe_batch?;
           let a = batch
               .column(3)
               .as_any()
               .downcast_ref::<Utf8Array<i32>>()
               .unwrap();
           for i in a {
               if let Some(i) = i {
                   distinct.insert(i.to_string());
               }
           }
       }
       println!("{}", distinct.len());
       println!("{:#?}", distinct);
   ```
   using 
   
   ```
   cargo run --features io_parquet --example parquet_read_record -- parquets/stops/2021-11.parquet
   ```
   
   yields 132 valid stop_names (over all row groups):
   ```
   {
       "pl.Na Rozdrożu",
       "pl.Zawiszy",
       "Szczęśliwice",
       "Stawki",
       "Vogla",
       "PKP Płudy",
       "Ceramiczna",
       "Chłodna",
       "rondo Zesłańców Syberyjskich",
       "PUSTELNIK",
       "Strzeleckiego",
       "Osiedle",
       "Kanał Gocławski",
       "Centrum",
       "Żerań FSO",
       "Bystra",
       "Powsinek",
       "Sadkowska",
       "Dzika",
       "Hala Kopińska",
       "Nowolipie",
       "gen.Zajączka",
       "Leśnej Polanki",
       "Rembielińska",
       "Pelcowizna",
       "Armatnia",
       "Bełdan",
       "Ćmielowska",
       "Miła",
       "Zamieniecka",
       "Opaczewska",
       "Metro Stadion Narodowy",
       "Bohomolca",
       "Odkryta",
       "pl.Inwalidów",
       "Parafialna",
       "Marysin",
       "Marcelin",
       "Marymont-Potok",
       "Oś Królewska",
       "RONDO ZESŁAŃCÓW SYBERYJSKICH",
       "Białołęka-Ratusz",
       "Mennica",
       "Rudzka",
       "Daniszewska",
       "Budowlana",
       "Sobocka",
       "Starego Dębu",
       "Metro Ratusz-Arsenał",
       "PKP Olszynka Grochowska",
       "Fabryka Pomp",
       "CH Marki",
       "Łysakowska",
       "Brzezińska",
       "Cm.Wolski",
       "Olesin",
       "Dw.Gdański",
       "pl.Starynkiewicza",
       "pl.Wilsona",
       "Ciołkosza",
       "CH Promenada",
       "os.Potok",
       "Norblin",
       "Zbójna Góra",
       "Wola-Ratusz",
       "Czołgistów",
       "Rozbrat",
       "pl.Narutowicza",
       "Rokosowska",
       "Metro Politechnika",
       "Nowodwory",
       "Rezedowa",
       "Park Praski",
       "Dw.Wileński",
       "Bartnicza",
       "Kijowska",
       "Cygańska",
       "Ołówkowa",
       "Marszałkowska",
       "ARMATNIA",
       "PKP Żerań",
       "PKP Falenica",
       "METRO RATUSZ-ARSENAŁ",
       "Polnych Kwiatów",
       "Myśliborska",
       "Smocza",
       "pl.Konstytucji",
       "Urbanistów",
       "Okularowa",
       "Smugowa",
       "Marywilska-Las",
       "Gorzykowska",
       "Zyndrama z Maszkowic",
       "Szwedzka",
       "Dobosza",
       "Muranowska",
       "Majerankowa",
       "Stare Miasto",
       "Dąbrówka Wiślana",
       "Wawelska",
       "Insurekcji",
       "Kino Femina",
       "pl.Bankowy",
       "Poligonowa",
       "Gwiaździsta",
       "Branickiego",
       "Przyczółek Grochowski",
       "Wałbrzyska-Cmentarz",
       "Saska",
       "Raciborska",
       "Śpiewaków",
       "Bolesławicka",
       "Sienna",
       "Choszczówka",
       "Metro Księcia Janusza",
       "Metro Świętokrzyska",
       "os.Marywilska",
       "Chłodnia",
       "Wolności",
       "Bazyliańska",
       "Klaudyny",
       "Leszno",
       "Sarmacka",
       "Metro Stokłosy",
       "Ćwiklińskiej",
       "Inflancka",
       "Parowozowa",
       "Małych Dębów",
       "Zajezdnia Ostrobramska",
       "EC Żerań",
       "Milenijna",
       "Świątynia Opatrzności Bożej",
   }
   ```
   
   For reference, the file heavily uses RLE-encoding (i.e. the RLE bit of the RLE-bitpacking hybrid parquet encoder), both for the validity and for the dictionary indices, so that would be a place to go for.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000461320


   I will probably file a ticket in arrow-rs shortly with the slimmed down reproducer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1004332824


   I tested the fix from @yordan-pavlov  in https://github.com/apache/arrow-rs/pull/1130  against my reproducer and with #1130  it now gets the correct answer ❤️ so that seems like progress


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992609965


   thinking out loud: doing `pandas` validation:
   
   ```jupyter
   In [10]: import pandas as pd
   
   In [11]: csv_pd = pd.read_csv('./csvs/stop.csv')
   
   In [12]: pq_pd = pd.read_parquet('./parquets/stops')
   
   In [13]: csv_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  object
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: float64(1), object(3)
   memory usage: 1.0+ MB
   
   In [14]: pq_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  datetime64[ns]
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: datetime64[ns](1), float64(1), object(2)
   memory usage: 1.0+ MB
   
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] franeklubi edited a comment on issue #1441: Incorrect results in datafusion-cli

Posted by GitBox <gi...@apache.org>.

franeklubi edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992446558


   [no_nulls.zip](https://github.com/apache/arrow-datafusion/files/7703926/no_nulls.zip)
   
   In the file above I removed NULLs from Parquet files (missing strings got replaced with "NULL" and ints with 0).
   
   Querying these files produces the desired results, which further confirms my theory, that there is something wrong with NULL handling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1002798788


   FYI https://github.com/apache/arrow-rs/pull/1110 runs into a similar issue, that appears to be fixed by switching to ComplexObjectArrayReader instead of ArrowArrayReader. I will have a poke around tomorrow if I have time, and see if I can spot what is going wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1038174703


   I reran the queries from @franeklubi at 2e918184c502a40a041fb0163702cb6ab8de0af9:
   
   # Setup
   ```sql
   CREATE EXTERNAL TABLE stop (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/stop.csv';
   CREATE EXTERNAL TABLE trip (tid TEXT, line TEXT, base_day TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/trip.csv';
   
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/stops/';
   CREATE EXTERNAL TABLE trip_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/trips/';
   ```
   
   Now the results are consistent when using csv and parquet:
   
   ## Parquet
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop_parquet INNER JOIN trip_parquet ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   27 rows in set. Query took 0.042 seconds.
   ```
   
   ## Csv
   
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   |                      |
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   +----------------------+
   27 rows in set. Query took 0.116 seconds.
   ```
   
   Interestingly, the CSV results don't seem to have the `NULLS LAST` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1038174703


   I reran the queries from @franeklubi :
   
   # Setup
   ```sql
   CREATE EXTERNAL TABLE stop (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/stop.csv';
   CREATE EXTERNAL TABLE trip (tid TEXT, line TEXT, base_day TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Downloads/issue_data/csvs/trip.csv';
   
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/stops/';
   CREATE EXTERNAL TABLE trip_parquet STORED AS PARQUET LOCATION '/Users/alamb/Downloads/issue_data/parquets/trips/';
   ```
   
   Now the results are consistent when using csv and parquet:
   
   ## Parquet
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop_parquet INNER JOIN trip_parquet ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   27 rows in set. Query took 0.042 seconds.
   ```
   
   ## Csv
   
   ```sql
   ❯ SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST;
   +----------------------+
   | stop_name            |
   +----------------------+
   |                      |
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   +----------------------+
   27 rows in set. Query took 0.116 seconds.
   ```
   
   Interestingly, the CSV results don't seem to have the `NULLS LAST` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992609965


   thinking out loud: doing `pandas` validation:
   
   ```ipython
   In [10]: import pandas as pd
   
   In [11]: csv_pd = pd.read_csv('./csvs/stop.csv')
   
   In [12]: pq_pd = pd.read_parquet('./parquets/stops')
   
   In [13]: csv_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  object
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: float64(1), object(3)
   memory usage: 1.0+ MB
   
   In [14]: pq_pd.info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 33254 entries, 0 to 33253
   Data columns (total 4 columns):
    #   Column     Non-Null Count  Dtype
   ---  ------     --------------  -----
    0   time       33254 non-null  datetime64[ns]
    1   trip_tid   32113 non-null  float64
    2   trip_line  32126 non-null  object
    3   stop_name  705 non-null    object
   dtypes: datetime64[ns](1), float64(1), object(2)
   memory usage: 1.0+ MB
   
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992607329


   thanks for the update.
   
   to people reading this, i'm still trying to minimize the reproduction steps, so i guess below is a simpler statement:
   
   ```
   CREATE EXTERNAL TABLE stop_parquet STORED AS PARQUET LOCATION './parquets/stops';
   CREATE EXTERNAL TABLE stop_csv (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV LOCATION './csvs/stop.csv';
   ```
   
   ```
   ❯ select distinct stop_name from stop_csv;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Szczęśliwice                 |
   | Wawelska                     |
   | Bolesławicka                 |
   ...
   | Ceramiczna                   |
   | Czołgistów                   |
   +------------------------------+
   134 rows in set. Query took 0.015 seconds.
   ```
   
   ```
   ❯ select distinct stop_name from stop_parquet;
   +------------------------------+
   | stop_name                    |
   +------------------------------+
   | Milenijna                    |
   | Park Praski                  |
   | Wolności                     |
   ...
   | Dzika                        |
   | Budowlana                    |
   | PKP Płudy                    |
   | Bolesławicka                 |
   | Marcelin                     |
   +------------------------------+
   112 rows in set. Query took 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-992576814


   hi @franeklubi thanks for the detailed sharing.
   
   to simplify bug reproduction, can you help me understand the difference between parquet and csv data? specifically:
   
   ```
   ❯ CREATE EXTERNAL TABLE stop STORED AS PARQUET LOCATION './parquets/stops';
   0 rows in set. Query took 0.001 seconds.
   ❯ select count(*) from stop;
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 33254           |
   +-----------------+
   1 row in set. Query took 0.007 seconds.
   ```
   
   ```
   ❯ CREATE EXTERNAL TABLE stop (time TEXT, trip_tid TEXT, trip_line TEXT, stop_name TEXT) STORED AS CSV LOCATION './csvs/stop.csv';
   0 rows in set. Query took 0.000 seconds.
   ❯ select count(*) from stop;
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 33255           |
   +-----------------+
   1 row in set. Query took 0.014 seconds.
   ```
   
   they seem to have different number of rows.
   
   Same thing applies to trips data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1013688365


   I believe with the update to use arrow 7.0.0 which contains @yordan-pavlov 's fix, this should now be fixed in DataFusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000429890


   Cross referencing the data output from pandas and the parquer reader, 
   Here are the rows from pandas and arrow/parquet that have non null values for `stop_name`:
   
   Pandas:
   ```
   1523,2021-11-15 02:40:32,54827807,102,PUSTELNIK
   2475,2021-11-15 04:54:14,54807500,102,PKP Olszynka Grochowska
   6218,2021-11-15 10:25:27,54802989,104,Żerań FSO
   7286,2021-11-15 04:29:31,54787914,140,METRO RATUSZ-ARSENAŁ
   7431,2021-11-15 08:23:38,54793831,157,Rokosowska
   7433,2021-11-15 10:08:11,54793833,157,Sienna
   7438,2021-11-15 11:56:45,54793835,157,Wola-Ratusz
   7447,2021-11-15 21:15:54,54793844,157,Miła
   7479,2021-11-15 12:06:00,54793886,157,Hala Kopińska
   7692,2021-11-15 10:56:08,54793834,157,Mennica
   7693,2021-11-15 11:52:22,54793835,157,Smocza
   7694,2021-11-15 11:58:07,54793835,157,Wola-Ratusz
   7696,2021-11-15 14:42:33,54793838,157,Wawelska
   7702,2021-11-15 20:24:26,54793843,157,Muranowska
   7819,2021-11-15 04:59:07,54793828,157,pl.Starynkiewicza
   7824,2021-11-15 08:08:57,54793831,157,Chłodna
   7827,2021-11-15 10:11:37,54793833,157,pl.Zawiszy
   7828,2021-11-15 10:49:38,54793834,157,pl.Zawiszy
   7829,2021-11-15 10:53:04,54793834,157,Sienna
   7830,2021-11-15 12:16:45,54793835,157,Dobosza
   
   ```
   
   And arrow says the following (interestingly, note that the `PUSTELNIK` is repeated and then the sequence of values is very similar but offset)
   ```
   1523,2021-11-15 02:40:32,54827807,102,PUSTELNIK
   2475,2021-11-15 04:54:14,54807500,102,PKP Olszynka Grochowska
   6218,2021-11-15 10:25:27,54802989,104,PUSTELNIK
   7286,2021-11-15 04:29:31,54787914,140,PUSTELNIK
   7431,2021-11-15 08:23:38,54793831,157,PUSTELNIK
   7433,2021-11-15 10:08:11,54793833,157,PUSTELNIK
   7438,2021-11-15 11:56:45,54793835,157,PUSTELNIK
   7447,2021-11-15 21:15:54,54793844,157,PUSTELNIK
   7479,2021-11-15 12:06:00,54793886,157,Żerań FSO
   7692,2021-11-15 10:56:08,54793834,157,Rokosowska
   7693,2021-11-15 11:52:22,54793835,157,Sienna
   7694,2021-11-15 11:58:07,54793835,157,Wola-Ratusz
   7696,2021-11-15 14:42:33,54793838,157,Miła
   7702,2021-11-15 20:24:26,54793843,157,Hala Kopińska
   7819,2021-11-15 04:59:07,54793828,157,Chłodna
   7824,2021-11-15 08:08:57,54793831,157,Mennica
   7826,2021-11-15 08:36:09,54793832,157,Smocza
   7827,2021-11-15 10:11:37,54793833,157,Wola-Ratusz
   7828,2021-11-15 10:49:38,54793834,157,Wawelska
   7829,2021-11-15 10:53:04,54793834,157,Muranowska
   7830,2021-11-15 12:16:45,54793835,157,pl.Starynkiewicza
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold edited a comment on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

tustvold edited a comment on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000499250


   You could try using ComplexObjectArrayReader to decode the column instead of ArrowArrayReader. This might help narrow down if the bug lies in the RLE decoding or something higher up in ArrowArrayReader. Alternatively you could see if the issue occurs with #1082 which replaces ArrowArrayReader with an alternative that shares more with the primitive array implementation, again this might help narrow down the origin.
   
   Not at a computer at the moment, otherwise would try myself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000429890


   Cross referencing the data output from pandas and the parquer reader, 
   Here are the rows from pandas and arrow/parquet that have non null values for `stop_name`:
   
   Pandas:
   ```
   1523,2021-11-15 02:40:32,54827807,102,PUSTELNIK
   2475,2021-11-15 04:54:14,54807500,102,PKP Olszynka Grochowska
   6218,2021-11-15 10:25:27,54802989,104,Żerań FSO
   7286,2021-11-15 04:29:31,54787914,140,METRO RATUSZ-ARSENAŁ
   7431,2021-11-15 08:23:38,54793831,157,Rokosowska
   7433,2021-11-15 10:08:11,54793833,157,Sienna
   7438,2021-11-15 11:56:45,54793835,157,Wola-Ratusz
   7447,2021-11-15 21:15:54,54793844,157,Miła
   7479,2021-11-15 12:06:00,54793886,157,Hala Kopińska
   7692,2021-11-15 10:56:08,54793834,157,Mennica
   7693,2021-11-15 11:52:22,54793835,157,Smocza
   7694,2021-11-15 11:58:07,54793835,157,Wola-Ratusz
   7696,2021-11-15 14:42:33,54793838,157,Wawelska
   7702,2021-11-15 20:24:26,54793843,157,Muranowska
   7819,2021-11-15 04:59:07,54793828,157,pl.Starynkiewicza
   7824,2021-11-15 08:08:57,54793831,157,Chłodna
   7827,2021-11-15 10:11:37,54793833,157,pl.Zawiszy
   7828,2021-11-15 10:49:38,54793834,157,pl.Zawiszy
   7829,2021-11-15 10:53:04,54793834,157,Sienna
   7830,2021-11-15 12:16:45,54793835,157,Dobosza
   
   ```
   
   And arrow says the following (interestingly, note that the `PUSTELNIK` is repeated and then the sequence of values is very similar but offset)
   ```
   6218,2021-11-15 10:25:27,54802989,104,PUSTELNIK
   7286,2021-11-15 04:29:31,54787914,140,PUSTELNIK
   7431,2021-11-15 08:23:38,54793831,157,PUSTELNIK
   7433,2021-11-15 10:08:11,54793833,157,PUSTELNIK
   7438,2021-11-15 11:56:45,54793835,157,PUSTELNIK
   7447,2021-11-15 21:15:54,54793844,157,PUSTELNIK
   7479,2021-11-15 12:06:00,54793886,157,Żerań FSO
   7692,2021-11-15 10:56:08,54793834,157,Rokosowska
   7693,2021-11-15 11:52:22,54793835,157,Sienna
   7694,2021-11-15 11:58:07,54793835,157,Wola-Ratusz
   7696,2021-11-15 14:42:33,54793838,157,Miła
   7702,2021-11-15 20:24:26,54793843,157,Hala Kopińska
   7819,2021-11-15 04:59:07,54793828,157,Chłodna
   7824,2021-11-15 08:08:57,54793831,157,Mennica
   7826,2021-11-15 08:36:09,54793832,157,Smocza
   7827,2021-11-15 10:11:37,54793833,157,Wola-Ratusz
   7828,2021-11-15 10:49:38,54793834,157,Wawelska
   7829,2021-11-15 10:53:04,54793834,157,Muranowska
   7830,2021-11-15 12:16:45,54793835,157,pl.Starynkiewicza
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org