You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Guillermo Duran (Jira)" <ji...@apache.org> on 2022/10/14 14:20:00 UTC

[jira] [Comment Edited] (ARROW-17432) [R] messed up rows when importing large csv into parquet

    [ https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617742#comment-17617742 ] 

Guillermo Duran edited comment on ARROW-17432 at 10/14/22 2:19 PM:
-------------------------------------------------------------------

Thanks for your suggestion [~fmic] 

As why we have a decimal notation on the checklists ID column... that's a good question, but basically that's how the data is exported from the eBird db.

 
{code:java}
$ head obs.csv
checklist_id,species_code,exotic_category,obs_count,only_presence_reported,only_slash_reported,valid,reviewed,has_media
771514.0,comgra,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,bkcchi,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,sonspa,,3.0,0.0,0.0,1.0,0.0,0.0
771514.0,bnhcow,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,amegfi,,4.0,0.0,0.0,1.0,0.0,0.0
771514.0,rewbla,,1.0,0.0,0.0,1.0,0.0,0.0
771515.0,yelwar,,4.0,0.0,0.0,1.0,0.0,0.0
771515.0,amecro,,1.0,0.0,0.0,1.0,0.0,0.0
{code}
I can do some regex trickery to remove the ".0" from the csv before _open_dataset()_ but in any case it's weird that I have the issue only when importing the csv with the R Arrow library. If I change the type for that column to int32 on the _open_dataset()_ schema,  I get a conversion error due to the mismatch.

 

If I do the subset with the decimal I still get the same weird rows:

 
{code:java}
sch = arrow::schema(checklist_id = float32(),
                    species_code = string(),
                    exotic_category = float32(),
                    obs_count = float32(),
                    only_presence_reported = float32(),
                    only_slash_reported = float32(),
                    valid = float32(),
                    reviewed = float32(),
                    has_media = float32()
                    )

csv_stream <- open_dataset(csv_file, format = "csv", 
                           col_types = sch)

csv_stream |> 
   filter(checklist_id == 18543372.0) |> 
   arrange(species_code) |> 
   collect()

# A tibble: 50 × 9
   checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media
          <dbl> <chr>                  <dbl>     <dbl>                  <dbl>               <dbl> <dbl>    <dbl>     <dbl>
 1     18543372 altori                    NA         3                      0                   0     1        0         0
 2     18543372 amekes                    NA         1                      0                   0     1        0         0
 3     18543372 amered                    NA        40                      0                   0     1        1         0
 4     18543372 amerob                    NA        30                      0                   0     1        0         0
 5     18543372 amerob                    NA         9                      0                   0     1        0         0
 6     18543372 balori                    NA         9                      0                   0     1        0         0
 7     18543372 blkter                    NA         9                      0                   0     1        0         0
 8     18543372 blkvul                    NA        20                      0                   0     1        0         0
 9     18543372 buggna                    NA         1                      0                   0     1        0         0
10     18543372 buwwar                    NA         1                      0                   0     1        0         0
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows{code}
 

Even if I filter by an interval, the result is the same:
{code:java}
csv_stream |> 
  filter(checklist_id > 18543371 & checklist_id < 18543373) |> 
  arrange(species_code) |> 
  collect()

# A tibble: 50 × 9
   checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media
          <dbl> <chr>                  <dbl>     <dbl>                  <dbl>               <dbl> <dbl>    <dbl>     <dbl>
 1     18543372 altori                    NA         3                      0                   0     1        0         0
 2     18543372 amekes                    NA         1                      0                   0     1        0         0
 3     18543372 amered                    NA        40                      0                   0     1        1         0
 4     18543372 amerob                    NA        30                      0                   0     1        0         0
 5     18543372 amerob                    NA         9                      0                   0     1        0         0
 6     18543372 balori                    NA         9                      0                   0     1        0         0
 7     18543372 blkter                    NA         9                      0                   0     1        0         0
 8     18543372 blkvul                    NA        20                      0                   0     1        0         0
 9     18543372 buggna                    NA         1                      0                   0     1        0         0
10     18543372 buwwar                    NA         1                      0                   0     1        0         0
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows {code}
 

Also, thinking about the possible differences between the schema used to import the csv in Arrow's Python vs R, it seems that both are the "same":
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pandas as pd 

test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
                            nrows = 1000)

sch = pa.Schema.from_pandas(test_rows_csv)

sch

checklist_id: double
species_code: string
exotic_category: double
obs_count: double
only_presence_reported: double
only_slash_reported: double
valid: double
reviewed: double
has_media: double
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1391{code}


was (Author: JIRAUSER294488):
Thanks for your suggestion [~fmic] 

As why we have a decimal notation on the checklists ID column... that's a good question, but basically that's how the data is exported from the eBird db.

 
{code:java}
$ head obs.csv
checklist_id,species_code,exotic_category,obs_count,only_presence_reported,only_slash_reported,valid,reviewed,has_media
771514.0,comgra,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,bkcchi,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,sonspa,,3.0,0.0,0.0,1.0,0.0,0.0
771514.0,bnhcow,,1.0,0.0,0.0,1.0,0.0,0.0
771514.0,amegfi,,4.0,0.0,0.0,1.0,0.0,0.0
771514.0,rewbla,,1.0,0.0,0.0,1.0,0.0,0.0
771515.0,yelwar,,4.0,0.0,0.0,1.0,0.0,0.0
771515.0,amecro,,1.0,0.0,0.0,1.0,0.0,0.0
{code}
I can do some regex trickery to remove the ".0" from the csv before _open_dataset()_ but in any case it's weird that I have the issue only when importing the csv with the R Arrow library. If I change the type for that column to int32 on the _open_dataset()_ schema,  I get a conversion error due to the mismatch.

 

If I do the subset with the decimal I still get the same weird rows:

 
{code:java}
sch = arrow::schema(checklist_id = float32(),
                    species_code = string(),
                    exotic_category = float32(),
                    obs_count = float32(),
                    only_presence_reported = float32(),
                    only_slash_reported = float32(),
                    valid = float32(),
                    reviewed = float32(),
                    has_media = float32()
                    )

csv_stream <- open_dataset(csv_file, format = "csv", 
                           col_types = sch)

csv_stream |> 
   filter(checklist_id == 18543372.0) |> 
   arrange(species_code) |> 
   collect()

# A tibble: 50 × 9
   checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media
          <dbl> <chr>                  <dbl>     <dbl>                  <dbl>               <dbl> <dbl>    <dbl>     <dbl>
 1     18543372 altori                    NA         3                      0                   0     1        0         0
 2     18543372 amekes                    NA         1                      0                   0     1        0         0
 3     18543372 amered                    NA        40                      0                   0     1        1         0
 4     18543372 amerob                    NA        30                      0                   0     1        0         0
 5     18543372 amerob                    NA         9                      0                   0     1        0         0
 6     18543372 balori                    NA         9                      0                   0     1        0         0
 7     18543372 blkter                    NA         9                      0                   0     1        0         0
 8     18543372 blkvul                    NA        20                      0                   0     1        0         0
 9     18543372 buggna                    NA         1                      0                   0     1        0         0
10     18543372 buwwar                    NA         1                      0                   0     1        0         0
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows{code}
 

Even if I filter by an interval, the result is the same:
{code:java}
csv_stream |> 
  filter(checklist_id > 18543371 & checklist_id < 18543373) |> 
  arrange(species_code) |> 
  collect()

# A tibble: 50 × 9
   checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media
          <dbl> <chr>                  <dbl>     <dbl>                  <dbl>               <dbl> <dbl>    <dbl>     <dbl>
 1     18543372 altori                    NA         3                      0                   0     1        0         0
 2     18543372 amekes                    NA         1                      0                   0     1        0         0
 3     18543372 amered                    NA        40                      0                   0     1        1         0
 4     18543372 amerob                    NA        30                      0                   0     1        0         0
 5     18543372 amerob                    NA         9                      0                   0     1        0         0
 6     18543372 balori                    NA         9                      0                   0     1        0         0
 7     18543372 blkter                    NA         9                      0                   0     1        0         0
 8     18543372 blkvul                    NA        20                      0                   0     1        0         0
 9     18543372 buggna                    NA         1                      0                   0     1        0         0
10     18543372 buwwar                    NA         1                      0                   0     1        0         0
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows {code}
 

> [R] messed up rows when importing large csv into parquet
> --------------------------------------------------------
>
>                 Key: ARROW-17432
>                 URL: https://issues.apache.org/jira/browse/ARROW-17432
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0, 9.0.0
>         Environment: R version 4.2.1
> Running in Arch Linux - EndeavourOS
> arrow_info()
> Arrow package version: 9.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> gcs        TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>                   
> Allocator jemalloc
> Current   49.31 Kb
> Max        1.63 Mb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                           
> C++ Library Version  9.0.0
> C++ Compiler           GNU
> C++ Compiler Version 7.5.0
> ####
> print(pa.__version__)
> 9.0.0
>            Reporter: Guillermo Duran
>            Priority: Major
>
> This is a weird issue that creates new rows when importing a large csv (56 GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT didn't occur with the Python Arrow library 9.0.0. Due to the large size of the original csv it's difficult to create a reproducible example, but I share the code and outputs.
> The code I use in R to import the csv:
> {code:java}
> library(arrow)
> library(dplyr)
>  
> csv_file <- "/ebird_erd2021/full/obs.csv"
> dest <- "/ebird_erd2021/full/obs_parquet/" 
> sch = arrow::schema(checklist_id = float32(),
>                     species_code = string(),
>                     exotic_category = float32(),
>                     obs_count = float32(),
>                     only_presence_reported = float32(),
>                     only_slash_reported = float32(),
>                     valid = float32(),
>                     reviewed = float32(),
>                     has_media = float32()
>                     )
> csv_stream <- open_dataset(csv_file, format = "csv", 
>                            schema = sch, skip_rows = 1)
> write_dataset(csv_stream, dest, format = "parquet", 
>               max_rows_per_file=1000000L,
>               hive_style = TRUE,
>               existing_data_behavior = "overwrite"){code}
> When I load the dataset and check one random _checklist_id_ I get rows that are not part of the _obs.csv_ file. There shouldn't be duplicated species in a checklist but there are ({_}amerob{_} for example)...  also note that the duplicated species have different {_}obs_count{_}. 50 species in total in that specific {_}checklist_id{_}.
> {code:java}
> parquet_arrow <- open_dataset(dest, format = "parquet")
> parquet_arrow |> 
>   filter(checklist_id == 18543372) |> 
>   arrange(species_code) |> 
>   collect() 
> # A tibble: 50 × 3
>    checklist_id species_code obs_count
>           <dbl> <chr>            <dbl>
>  1     18543372 altori               3
>  2     18543372 amekes               1
>  3     18543372 amered              40
>  4     18543372 amerob              30
>  5     18543372 amerob               9
>  6     18543372 balori               9
>  7     18543372 blkter               9
>  8     18543372 blkvul              20
>  9     18543372 buggna               1
> 10     18543372 buwwar               1
> # … with 40 more rows
> # ℹ Use `print(n = ...)` to see more rows{code}
> If I use awk to query the csv file with that same checklist id, I get something different:
> {code:java}
> $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
> 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,houwre,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amered,,40.0,0.0,0.0,1.0,1.0,0.0
> 18543372.0,norwat,,2.0,0.0,0.0,1.0,0.0,0.0{code}
> 17 different species and no repetitions... Look _amerob_ on the 2nd line only, with 30 _obs_count_
>  
> If I import the csv into parquet using the Python Arrow library as:
> {code:java}
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.compute as pc
> import pandas as pd 
> test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
>                             nrows = 1000)
> sch = pa.Schema.from_pandas(test_rows_csv)
> csv_file = ds.dataset("/ebird_erd2021/full/obs.csv",
>                       schema = sch,
>                       format = "csv")
> ds.write_dataset(csv_file, 
>                  "ebird_erd2021/full/obs_parquet_py/",
>                  format = "parquet",
>                  schema = sch,
>                  use_threads = True,
>                  max_rows_per_file = 1000000,
>                  max_rows_per_group = 1000000,
>                  existing_data_behavior = "error"){code}
> And then load it in R doing the same checklist search:
> {code:java}
> parquet_py <- "/ebird_erd2021/full/obs_parquet_py/"
> parquet_arrow <- open_dataset(parquet_py, format = "parquet")
> parquet_arrow |> 
>   filter(checklist_id == 18543372) |> 
>   arrange(species_code) |> 
>   select(checklist_id, species_code, obs_count) |> 
>   collect()
> # A tibble: 17 × 3
>    checklist_id species_code obs_count
>           <dbl> <chr>            <dbl>
>  1     18543372 amered              40
>  2     18543372 amerob              30
>  3     18543372 balori               9
>  4     18543372 buggna               1
>  5     18543372 buwwar               1
>  6     18543372 cangoo               6
>  7     18543372 eastow               1
>  8     18543372 gowwar               1
>  9     18543372 grycat               1
> 10     18543372 houwre               1
> 11     18543372 norwat               2
> 12     18543372 ovenbi1              1
> 13     18543372 reshaw               1
> 14     18543372 rewbla              60
> 15     18543372 robgro               2
> 16     18543372 sedwre1              2
> 17     18543372 turvul               1{code}
> I get exactly what I should. No _species_code_ repeated and 17 different species.
>  
> Due to these differences I guess something weird must be happening in the R arrow library.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)