You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/20 14:12:13 UTC

[GitHub] [arrow] eitsupi commented on issue #12469: [R] int32/int64 issues in opening CSVs

eitsupi commented on issue #12469:
URL: https://github.com/apache/arrow/issues/12469#issuecomment-1046246626


   The following code may be helpful.
   
   ```R
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   ds <- open_dataset("1960-1-01.csv", format = "csv")
   
   ds |>
     mutate(n = cast(n, int32())) |>
     write_dataset("tmp_dir", format = "parquet")
   
   open_dataset("tmp_dir")
   ```
   
   > I haven't been able to find any information on converting a `FileSystemDataset` to a `Table`. I have no idea where to go from here.
   
   `dplyr::compute()` can be used in the `dplyr` pipeline.
   This is in line with the general usage of `dbplyr`, but it certainly seems to be barely mentioned in the arrow documentation.
   
   By the way, I recently did the same thing with pyarrow, but first I read each CSV file with its exact type and converted it to a Parquet file, then I read a collection of Parquet files and wrote them out again in a partitioned dataset.
   The Parquet to Parquet conversion was very easy, as the read/write speed is much faster once converted to Parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org