You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Zhuo Jia Dai (JIRA)" <ji...@apache.org> on 2019/08/14 00:07:00 UTC

[jira] [Created] (ARROW-6230) Reading in parquent files are 20x slower than reading fst files in R

Zhuo Jia Dai created ARROW-6230:
-----------------------------------

             Summary: Reading in parquent files are 20x slower than reading fst files in R
                 Key: ARROW-6230
                 URL: https://issues.apache.org/jira/browse/ARROW-6230
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
         Environment: Windows 10 Pro and Ubuntu 
            Reporter: Zhuo Jia Dai
             Fix For: 0.14.1
         Attachments: image-2019-08-14-10-04-56-834.png

*Problem*

Loading any of the data I mentioned below is 20x slower than the fst format in R.

 

*How to get the data*

[https://loanperformancedata.fanniemae.com/lppub/index.html]

Register and download any of these. I can't provide the data to you, and I think it's best you register.

 

!image-2019-08-14-10-04-56-834.png!

 

*Code*

path = "data/Performance_2016Q4.txt"

library(data.table)
library(arrow)

a = data.table::fread(path, header = FALSE)

fst::write_fst(a, "data/a.fst")

arrow::write_parquet(a, "data/a.parquet")

rm(a); gc()
# read in test
system.time(a <- fst::read_fst("data/a.fst"))
# 4.61 seconds

rm(a); gc()
# read in test
system.time(a <- arrow::read_parquet("data/a.parquet"))
# 99.19 seconds



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)