You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/23 19:04:40 UTC

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1441: Incorrect results in datafusion

jorgecarleitao commented on issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-1000479749


   I can also open the parquet files from arrow2. I think that this is something on the parquet crate.
   
   the below pasted in [this example](https://github.com/jorgecarleitao/arrow2/blob/main/examples/parquet_read_record.rs):
   ```rust
   let mut distinct = HashSet::<String>::new();
       let start = SystemTime::now();
       for maybe_batch in reader {
           let batch = maybe_batch?;
           let a = batch
               .column(3)
               .as_any()
               .downcast_ref::<Utf8Array<i32>>()
               .unwrap();
           for i in a {
               if let Some(i) = i {
                   distinct.insert(i.to_string());
               }
           }
       }
       println!("{}", distinct.len());
       println!("{:#?}", distinct);
   ```
   using 
   
   ```
   cargo run --features io_parquet --example parquet_read_record -- parquets/stops/2021-11.parquet
   ```
   
   yields 132 valid stop_names (over all row groups):
   ```
   {
       "pl.Na Rozdrożu",
       "pl.Zawiszy",
       "Szczęśliwice",
       "Stawki",
       "Vogla",
       "PKP Płudy",
       "Ceramiczna",
       "Chłodna",
       "rondo Zesłańców Syberyjskich",
       "PUSTELNIK",
       "Strzeleckiego",
       "Osiedle",
       "Kanał Gocławski",
       "Centrum",
       "Żerań FSO",
       "Bystra",
       "Powsinek",
       "Sadkowska",
       "Dzika",
       "Hala Kopińska",
       "Nowolipie",
       "gen.Zajączka",
       "Leśnej Polanki",
       "Rembielińska",
       "Pelcowizna",
       "Armatnia",
       "Bełdan",
       "Ćmielowska",
       "Miła",
       "Zamieniecka",
       "Opaczewska",
       "Metro Stadion Narodowy",
       "Bohomolca",
       "Odkryta",
       "pl.Inwalidów",
       "Parafialna",
       "Marysin",
       "Marcelin",
       "Marymont-Potok",
       "Oś Królewska",
       "RONDO ZESŁAŃCÓW SYBERYJSKICH",
       "Białołęka-Ratusz",
       "Mennica",
       "Rudzka",
       "Daniszewska",
       "Budowlana",
       "Sobocka",
       "Starego Dębu",
       "Metro Ratusz-Arsenał",
       "PKP Olszynka Grochowska",
       "Fabryka Pomp",
       "CH Marki",
       "Łysakowska",
       "Brzezińska",
       "Cm.Wolski",
       "Olesin",
       "Dw.Gdański",
       "pl.Starynkiewicza",
       "pl.Wilsona",
       "Ciołkosza",
       "CH Promenada",
       "os.Potok",
       "Norblin",
       "Zbójna Góra",
       "Wola-Ratusz",
       "Czołgistów",
       "Rozbrat",
       "pl.Narutowicza",
       "Rokosowska",
       "Metro Politechnika",
       "Nowodwory",
       "Rezedowa",
       "Park Praski",
       "Dw.Wileński",
       "Bartnicza",
       "Kijowska",
       "Cygańska",
       "Ołówkowa",
       "Marszałkowska",
       "ARMATNIA",
       "PKP Żerań",
       "PKP Falenica",
       "METRO RATUSZ-ARSENAŁ",
       "Polnych Kwiatów",
       "Myśliborska",
       "Smocza",
       "pl.Konstytucji",
       "Urbanistów",
       "Okularowa",
       "Smugowa",
       "Marywilska-Las",
       "Gorzykowska",
       "Zyndrama z Maszkowic",
       "Szwedzka",
       "Dobosza",
       "Muranowska",
       "Majerankowa",
       "Stare Miasto",
       "Dąbrówka Wiślana",
       "Wawelska",
       "Insurekcji",
       "Kino Femina",
       "pl.Bankowy",
       "Poligonowa",
       "Gwiaździsta",
       "Branickiego",
       "Przyczółek Grochowski",
       "Wałbrzyska-Cmentarz",
       "Saska",
       "Raciborska",
       "Śpiewaków",
       "Bolesławicka",
       "Sienna",
       "Choszczówka",
       "Metro Księcia Janusza",
       "Metro Świętokrzyska",
       "os.Marywilska",
       "Chłodnia",
       "Wolności",
       "Bazyliańska",
       "Klaudyny",
       "Leszno",
       "Sarmacka",
       "Metro Stokłosy",
       "Ćwiklińskiej",
       "Inflancka",
       "Parowozowa",
       "Małych Dębów",
       "Zajezdnia Ostrobramska",
       "EC Żerań",
       "Milenijna",
       "Świątynia Opatrzności Bożej",
   }
   ```
   
   For reference, the file heavily uses RLE-encoding (i.e. the RLE bit of the RLE-bitpacking hybrid parquet encoder), both for the validity and for the dictionary indices, so that would be a place to go for.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org