You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/08/14 02:41:00 UTC

[jira] [Comment Edited] (ARROW-6230) [R] Reading in parquent files are 20x slower than reading fst files in R

    [ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906812#comment-16906812 ] 

Wes McKinney edited comment on ARROW-6230 at 8/14/19 2:40 AM:
--------------------------------------------------------------

Thanks for the example. I'm interested to see where the time is being spent. Reading Parquet files is quite fast in Python so I'll see what the performance is there also. 

There's some work going on for the current release (see ARROW-3772, ARROW-3325, ARROW-3246) that will enable direct writing of R factors to and from Parquet, so that could be a (no pun intended) factor in the results


was (Author: wesmckinn):
Thanks for the example. I'm interested to see what the time is being spent. Reading Parquet files is quite fast in Python so I'll see what the performance is there also. 

There's some work going on for the current release (see ARROW-3772, ARROW-3325, ARROW-3246) that will enable direct writing of R factors to and from Parquet, so that could be a (no pun intended) factor in the results

> [R] Reading in parquent files are 20x slower than reading fst files in R
> ------------------------------------------------------------------------
>
>                 Key: ARROW-6230
>                 URL: https://issues.apache.org/jira/browse/ARROW-6230
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>         Environment: Windows 10 Pro and Ubuntu 
>            Reporter: Zhuo Jia Dai
>            Priority: Major
>             Fix For: 0.14.1
>
>         Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> path = "data/Performance_2016Q4.txt"
> library(data.table)
> library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> # read in test
> system.time(a <- fst::read_fst("data/a.fst"))
> # 4.61 seconds
> rm(a); gc()
> # read in test
> system.time(a <- arrow::read_parquet("data/a.parquet"))
> # 99.19 seconds



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)