You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/20 14:15:13 UTC

[GitHub] [arrow] eitsupi edited a comment on issue #12469: [R] int32/int64 issues in opening CSVs

eitsupi edited a comment on issue #12469:
URL: https://github.com/apache/arrow/issues/12469#issuecomment-1046246626


   The following code may be helpful.
   
   ```R
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   ds <- open_dataset("1960-1-01.csv", format = "csv")
   
   ds |>
     mutate(n = cast(n, int32())) |>
     write_dataset("tmp_dir", format = "parquet")
   
   open_dataset("tmp_dir")
   ```
   
   I don't think I would have noticed the existence of `cast()` if I hadn't touched pyarrow, since it doesn't exist as a stand-alone function and is currently barely mentioned in the help pages, which can be used in dplyr calculations.
   It's definitely worth improving the documentation.
   
   > I haven't been able to find any information on converting a `FileSystemDataset` to a `Table`. I have no idea where to go from here.
   
   `dplyr::compute()` can be used in the `dplyr` pipeline.
   This is in line with the general usage of `dbplyr`, but it certainly seems to be barely mentioned in the arrow documentation.
   
   By the way, I recently did the same thing with pyarrow, but first I read each CSV file with its exact type and converted it to a Parquet file, then I read a collection of Parquet files and wrote them out again in a partitioned dataset.
   The Parquet to Parquet conversion was very easy, as the read/write speed is much faster once converted to Parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org