You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/09 14:07:18 UTC
[GitHub] [arrow] OfekShilon opened a new issue, #15271: [R] Arrow saving and loading much slower than native R for small data frames
OfekShilon opened a new issue, #15271:
URL: https://github.com/apache/arrow/issues/15271
### Describe the bug, including details regarding any error messages, version, and platform.
Test script that measures R/arrow load time for various sizes:
```r
colnums <- c(10,20,30,100,150,200,300,500)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0("~/tmp/robj.",rownum,"x",colnum)
fn.arrow <- paste0("~/tmp/arrow.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum, ncol=colnum))
save(dat, file=fn.robj)
arrow::write_feather(x = dat, sink = fn.arrow)
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0("~/tmp/robj.",rownum,"x",colnum)
fn.arrow <- paste0("~/tmp/arrow.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow);
times.arrow[i,j] <- Sys.time()-start
}
}
```
Results:
```
> times.arrow / times.robj
10 cols 20 cols 30 cols 100 cols 150 cols 200 cols 300 cols 500 cols
1 rows 16.1439951 19.7020075 25.1108247 51.1643757 77.1529228 91.3080397 111.3643533 149.3513743
2 rows 15.0277094 21.2175810 22.2626322 48.8661710 68.6573327 650.6486486 134.8991050 130.5041691
3 rows 14.6777409 20.1436969 20.9700806 47.7467603 63.9312016 68.5315315 98.5874855 119.4731097
4 rows 13.2236921 17.4342891 20.9966044 43.8189867 57.1619048 64.3601299 94.4213217 118.8271915
5 rows 12.6945607 14.8067084 18.7377778 36.4182165 49.6366695 56.7033511 73.2449044 115.0325528
10 rows 13.1203008 16.9616537 16.7252696 37.5056129 47.2363992 56.1606467 76.4436374 86.6117791
20 rows 12.4548896 774.0376940 17.5051370 32.4073774 35.6958398 39.4063311 46.5070936 51.8869215
30 rows 10.2758259 12.8381764 15.6813459 25.9489239 30.6835476 31.7596519 35.4976311 41.5393059
40 rows 10.8671210 7.8244697 15.1399804 23.4805764 29.2812743 26.6662289 31.4367649 42.6152522
50 rows 11.3902007 12.6833417 15.2992519 25.2068532 27.2051708 28.9717248 32.0606809 36.8470872
60 rows 10.9138495 14.1022129 16.6385948 22.7227723 26.6038445 27.9418484 28.5083841 33.9032176
70 rows 10.7040650 12.1799904 13.2777314 19.7737738 20.8106306 21.8470504 22.5418507 27.6593520
100 rows 10.7567132 11.7838963 12.8056854 15.0082676 28.4549343 18.1499451 21.5192503 22.0708589
200 rows 9.5018797 10.1656687 10.6434257 12.3456125 12.0490603 12.5274870 13.1872241 14.6434862
300 rows 9.6111111 8.9652621 8.9622146 9.3272070 9.1396644 10.0647620 10.6045769 12.0662228
400 rows 8.7160494 9.3873540 8.3236041 7.2730971 7.9281412 7.4078140 7.4032556 7.9848605
500 rows 7.1358811 6.4100263 6.4007276 6.0777437 6.6235458 6.2249675 6.3370181 6.9172020
1000 rows 5.3677043 4.4564087 4.1116463 3.6105644 3.2333922 3.2778293 3.2759320 3.4308380
2000 rows 3.5031858 2.5319266 2.4289314 1.8577107 1.7995663 1.7371557 1.7497375 1.8541778
3000 rows 2.5769010 6.3183501 1.7323371 1.3046406 1.2342389 1.2235438 1.3174136 1.2508460
4000 rows 2.0956563 1.4165296 1.8561829 0.9478190 0.8863266 1.2302510 0.8732958 0.8928616
5000 rows 1.6759777 1.2119986 1.1039393 0.8229102 1.3977869 0.9786898 0.9761781 0.8342817
10000 rows 0.9136646 0.6621193 0.5184357 0.4271505 0.3822572 0.3574329 0.3735044 0.4495687
```
Is this some known overhead? It seems rather large...
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #15271: [R] Arrow file load much slower than native R for small data frames
Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #15271:
URL: https://github.com/apache/arrow/issues/15271#issuecomment-1377370938
Thanks for opening this!
I'm not surprised that very small objects have this property...Arrow's columnar format exploits that there are frequently many more rows than columns, and there are some places in the R package where we loop over columns in R. Mostly that is fine, although looping in R for 500 columns, as you've seen, can result in some overhead. If you look at the absolute times (instead of the relative times), I imagine that what you're seeing is still very small (maybe 0.1s) overhead...R can just do that much faster.
Is there a workflow where you're seeing this impact analysis time?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] OfekShilon commented on issue #15271: [R] Arrow file load much slower than native R for small data frames
Posted by GitBox <gi...@apache.org>.
OfekShilon commented on issue #15271:
URL: https://github.com/apache/arrow/issues/15271#issuecomment-1377397602
> Is there a workflow where you're seeing this impact analysis time?
While transitioning from native-R storage to arrow, we saw substantial performance degradation in a workflow that generates many (10K+) small files in a cluster and then reads them back.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #15271: [R] Arrow file load much slower than native R for small data frames
Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #15271:
URL: https://github.com/apache/arrow/issues/15271#issuecomment-1377666643
Ok! It seems like the problem is metadata. On write, we stick some R metadata into the schema and use it to do some stuff when we recreate the data frame on the way out. Most of the time that metadata is unused, and because it involves an R loop we see performance issues.
If you write the file without the R metadata, it looks like reading it is much faster (but definitely test locally to confirm!).
If this works for you, we could add a flag to disable writing the R metadata (or disable loading it).
``` r
tmpdir <- tempfile()
dir.create(tmpdir)
colnums <- c(10,20,30,100,150,200,300,500)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum, ncol=colnum))
save(dat, file=fn.robj)
# create the table manually to avoid metadata
dat_table <- arrow::as_arrow_table(dat)
schema <- dat_table$schema
schema$metadata <- NULL
dat_table <- dat_table$cast(schema)
arrow::write_feather(x = dat_table, sink = fn.arrow, compression = "uncompressed")
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow, as_data_frame = TRUE, mmap = TRUE);
times.arrow[i,j] <- Sys.time()-start
}
}
times.arrow / times.robj
#> 10 cols 20 cols 30 cols 100 cols 150 cols 200 cols
#> 1 rows 14.0952381 12.2730769 14.97500000 9.20437956 8.75479744 8.13718412
#> 2 rows 16.0696517 14.9234234 14.28278689 9.90533981 8.56250000 8.56160000
#> 3 rows 13.7713004 13.4891775 11.53790614 8.57407407 8.35842294 7.07703704
#> 4 rows 14.7380952 95.9319149 11.25517241 8.21645022 6.92554992 6.31000000
#> 5 rows 14.4626168 12.9609375 11.72664360 7.66060606 6.76986755 5.90463576
#> 10 rows 12.3790323 10.7172414 9.49712644 6.16776316 5.58681876 4.66462793
#> 20 rows 11.2867647 62.1293103 7.77804296 4.57604790 3.61700263 3.06232877
#> 30 rows 10.3590604 8.4000000 7.05376344 3.62488129 2.55404571 11.87710970
#> 40 rows 9.9206349 7.3310185 6.19379845 3.17002417 2.39525463 1.98561465
#> 50 rows 11.6686567 6.9299781 5.71708185 2.71903751 2.14587738 1.81172220
#> 60 rows 8.8262032 6.7301255 5.47731092 16.19293478 1.97486961 1.62612613
#> 70 rows 8.8347339 6.4109312 5.19554849 2.44055069 1.78809932 1.56611431
#> 100 rows 7.7412935 5.3079526 4.49799197 2.03780242 1.44817927 1.29230357
#> 200 rows 6.7373358 3.8204819 3.00714286 1.16359795 0.87507926 0.72829531
#> 300 rows 4.9736842 2.9963603 2.74172185 0.85562541 0.63074822 0.51833064
#> 400 rows 3.9795134 2.4449307 1.77052632 0.82852432 0.54286035 0.40358784
#> 500 rows 3.4116356 2.0236613 1.46481876 0.55421516 0.40433317 1.58486533
#> 1000 rows 1.9754717 1.1283404 0.85135779 0.28743853 0.21571464 0.17574634
#> 2000 rows 1.1457113 0.5982890 0.41544440 0.15338311 0.10889462 0.09171419
#> 3000 rows 0.7994512 0.4206546 0.28310156 0.10914713 0.07384605 0.06096175
#> 4000 rows 0.6511236 0.3175360 0.23418670 0.07628486 0.05940748 0.04918424
#> 5000 rows 0.4762331 0.2692693 0.17628306 0.07026943 0.04680908 0.03953978
#> 10000 rows 0.2431953 0.1263146 0.08880676 0.03294864 0.02410858 0.02036180
#> 300 cols 500 cols
#> 1 rows 7.25616438 6.53256705
#> 2 rows 6.45868263 5.33515199
#> 3 rows 6.25084364 5.28482972
#> 4 rows 5.76898396 4.93076374
#> 5 rows 5.70020121 4.40679095
#> 10 rows 3.84901532 3.18668529
#> 20 rows 2.55096154 2.00625000
#> 30 rows 1.87665830 1.49675397
#> 40 rows 1.68254466 1.29062263
#> 50 rows 1.72490914 1.25016578
#> 60 rows 1.38690327 1.04475309
#> 70 rows 1.43247588 0.94965370
#> 100 rows 0.96589063 0.79135421
#> 200 rows 0.58168371 0.45303118
#> 300 rows 0.42742552 0.32061561
#> 400 rows 0.31737134 0.24586616
#> 500 rows 0.25752199 0.20255645
#> 1000 rows 0.55231620 0.10501935
#> 2000 rows 0.07068172 0.05902058
#> 3000 rows 0.04852037 0.13285303
#> 4000 rows 0.03674992 0.02832831
#> 5000 rows 0.01554990 0.02325099
#> 10000 rows 0.01585784 0.01236076
```
<sup>Created on 2023-01-10 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #15271: [R] Arrow file load much slower than native R for small data frames
Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #15271:
URL: https://github.com/apache/arrow/issues/15271#issuecomment-1377495190
Make sure you're writing using `compression = "uncompressed"`! It's not perfect, but is about 2x faster. I'll look into it to see if there's any way to skip some R code here to more directly call the C++ writer...even the leve of overhead with no compression that you've highlighted is confusing to me.
Using no compression:
``` r
tmpdir <- tempfile()
dir.create(tmpdir)
colnums <- c(10,20,30,100,150,200,300,500)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum, ncol=colnum))
save(dat, file=fn.robj)
arrow::write_feather(x = dat, sink = fn.arrow, compression = "uncompressed")
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow);
times.arrow[i,j] <- Sys.time()-start
}
}
times.arrow / times.robj
#> 10 cols 20 cols 30 cols 100 cols 150 cols 200 cols
#> 1 rows 14.4725275 17.9795082 18.4140625 21.90818859 47.65606362 22.84116694
#> 2 rows 15.1983806 16.2460317 16.9053030 18.67129630 20.76380952 37.10859729
#> 3 rows 21.7117117 15.6601562 15.0646259 17.03752759 17.77000000 19.34379458
#> 4 rows 15.7056277 16.7242798 14.8692810 16.17453799 16.61224490 18.86018642
#> 5 rows 13.1034483 14.4306050 14.9470199 14.90576923 17.99046105 18.01030928
#> 10 rows 12.5816327 12.9710611 13.5114943 12.35703002 28.33454988 13.22032289
#> 20 rows 12.0430464 10.7642276 10.1307339 9.10829493 8.45411765 9.29576547
#> 30 rows 11.1220238 9.6205251 8.8284024 6.56949960 6.90670927 7.49974529
#> 40 rows 10.7088235 9.0176600 8.0673953 6.57269790 6.01518560 6.51640071
#> 50 rows 8.8784119 8.7257384 7.2162162 5.68754448 5.36519115 5.89375727
#> 60 rows 9.7962963 8.1595960 6.8823529 5.16987179 10.22431958 4.99090247
#> 70 rows 8.4882075 8.1819961 6.6296296 5.04599761 4.74102564 4.54345654
#> 100 rows 8.2778993 6.3507692 5.5512821 3.87919776 3.18816885 3.65419847
#> 200 rows 6.9781818 4.6319149 11.3175395 2.39477680 2.22712351 2.23399873
#> 300 rows 5.9528875 3.4087948 2.8162523 2.28367392 1.53755051 1.65800866
#> 400 rows 4.7578419 3.0028986 2.2602876 2.15348917 1.26760074 1.21309890
#> 500 rows 4.1558308 2.5225768 2.2711656 1.41115560 1.05550257 1.02989052
#> 1000 rows 2.2786585 1.3790087 3.0056259 0.60250798 0.53179530 0.53369967
#> 2000 rows 1.3539916 1.5805147 0.5737926 0.30327838 0.27820840 0.27057028
#> 3000 rows 1.1347815 0.5374048 0.3965298 0.20412111 0.19350023 0.45714431
#> 4000 rows 0.7417894 0.4128671 3.5819726 0.24726677 0.14699569 0.14043276
#> 5000 rows 0.6041413 0.3378337 0.8593773 0.19491538 0.12437216 0.11456206
#> 10000 rows 0.3014837 0.1828018 0.1201612 0.02665133 0.05724913 0.05461478
#> 300 cols 500 cols
#> 1 rows 27.20939086 48.20383912
#> 2 rows 25.13126492 34.15562914
#> 3 rows 24.11811024 30.89401968
#> 4 rows 21.79393939 26.18478261
#> 5 rows 20.94679803 26.48522653
#> 10 rows 14.96833216 25.12523191
#> 20 rows 10.51369216 15.84330318
#> 30 rows 7.43155288 11.73603952
#> 40 rows 6.62136223 10.43135770
#> 50 rows 5.99006711 9.25798485
#> 60 rows 5.04369274 6.14095785
#> 70 rows 4.75809650 5.70886076
#> 100 rows 5.00190311 4.54890153
#> 200 rows 4.50396996 2.68490953
#> 300 rows 2.99969424 1.89673687
#> 400 rows 2.34352282 1.48038762
#> 500 rows 2.03165384 1.20663080
#> 1000 rows 0.70601711 0.63683243
#> 2000 rows 0.27909992 0.43289769
#> 3000 rows 0.18386126 0.20415949
#> 4000 rows 0.29411463 0.16423265
#> 5000 rows 0.11312960 0.12045428
#> 10000 rows 0.05825836 0.06443037
```
<sup>Created on 2023-01-10 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
Using default compression:
``` r
tmpdir <- tempfile()
dir.create(tmpdir)
colnums <- c(10,20,30,100,150,200,300,500)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum, ncol=colnum))
save(dat, file=fn.robj)
arrow::write_feather(x = dat, sink = fn.arrow)
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow);
times.arrow[i,j] <- Sys.time()-start
}
}
times.arrow / times.robj
#> 10 cols 20 cols 30 cols 100 cols 150 cols 200 cols
#> 1 rows 16.9572954 19.6031746 19.4701754 26.5231144 56.01642710 33.39605735
#> 2 rows 19.0990991 20.9177489 20.8730769 23.7868481 26.38644689 45.43119266
#> 3 rows 21.1547619 19.2469136 21.4253731 21.8924051 24.21588946 25.88827586
#> 4 rows 18.4112554 18.8007663 18.4275862 21.3195021 22.97166667 26.45885635
#> 5 rows 15.8395522 17.8750000 16.3880597 22.0901804 22.89716841 26.51943005
#> 10 rows 15.3061224 13.0547945 14.9222520 16.8244767 34.23970944 17.79330709
#> 20 rows 14.3840830 13.6781609 12.3011236 11.2735528 11.33975904 11.87954111
#> 30 rows 13.5421687 11.0495283 10.1816514 9.3316370 9.57760314 9.62248996
#> 40 rows 12.6453488 9.8964059 9.9819168 7.7601744 8.33240067 8.37088608
#> 50 rows 11.8975069 10.1530612 10.3616000 7.4708579 7.31219272 7.15629522
#> 60 rows 11.3643836 8.9316081 8.3958991 7.0183366 12.72128146 6.76300578
#> 70 rows 11.0265252 9.6686869 8.1184408 6.6577017 6.38455080 6.92413793
#> 100 rows 10.3680556 8.0369748 6.4965116 5.1863354 4.83441670 5.06206362
#> 200 rows 12.3647059 6.8830275 4.9482612 3.2896631 3.24210312 3.27877754
#> 300 rows 5.7400000 4.5351986 3.5697161 2.3988402 2.14011906 2.06634286
#> 400 rows 5.0799087 2.9543702 2.8629648 1.7690058 1.72880966 1.76503533
#> 500 rows 4.4447884 2.8496770 2.3769231 1.4735886 1.35359428 1.52543420
#> 1000 rows 2.7072555 1.5854657 1.3616873 0.7840171 0.76427293 0.72445101
#> 2000 rows 1.5208333 0.8911792 0.6701459 0.4350788 0.37124991 0.37946588
#> 3000 rows 1.0453862 0.6643997 0.5169999 0.2656266 0.24755968 0.25853659
#> 4000 rows 0.8616682 0.4784442 0.4127477 0.2119238 0.19982264 0.19844568
#> 5000 rows 0.8958047 0.3799294 0.3235682 0.1832789 0.16097686 0.16914301
#> 10000 rows 0.3733628 0.2193108 0.1665289 0.1076588 0.09350925 0.08932051
#> 300 cols 500 cols
#> 1 rows 35.87483176 62.1506196
#> 2 rows 32.28801843 44.1924342
#> 3 rows 31.48050459 39.4118098
#> 4 rows 29.49416755 36.0374823
#> 5 rows 28.25379171 34.6821192
#> 10 rows 20.96233383 29.6552511
#> 20 rows 12.59460738 21.9988169
#> 30 rows 10.30442541 15.0805057
#> 40 rows 9.17821473 13.3024585
#> 50 rows 7.90048940 10.9834538
#> 60 rows 7.22199747 8.0121655
#> 70 rows 7.09084699 7.5827408
#> 100 rows 5.27838565 5.9264278
#> 200 rows 5.55643482 3.2979336
#> 300 rows 3.63902649 2.3820292
#> 400 rows 3.04591480 1.9261239
#> 500 rows 2.45318492 1.5959291
#> 1000 rows 1.27772319 0.8132839
#> 2000 rows 0.70657236 0.4209621
#> 3000 rows 0.50213646 0.2835666
#> 4000 rows 0.20044236 0.2147253
#> 5000 rows 0.14406603 0.1745972
#> 10000 rows 0.08071889 0.1012044
```
<sup>Created on 2023-01-10 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] OfekShilon commented on issue #15271: [R] Arrow file load much slower than native R for small data frames
Posted by GitBox <gi...@apache.org>.
OfekShilon commented on issue #15271:
URL: https://github.com/apache/arrow/issues/15271#issuecomment-1382862116
@paleolimbot Thanks for the suggestion. I'm aware of the overhead of metadata from [this discussion](https://github.com/apache/arrow/pull/15252#issuecomment-1375760926), but there is no metadata to speak of in the files in this example (not even row names) - and indeed I don't see any definite win by dropping it:
```
tmpdir <- tempfile()
dir.create(tmpdir)
colnums <- c(10,20,30,100,150,200)
rownums <- c(1,2,3,4,5,10,20,30,40,50,60,70,100,200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000, 10000)
# Generate files
for (colnum in colnums) {
for (rownum in rownums) {
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
fn.arrow.nometa <- paste0(tmpdir, "/arrow.nometa.",rownum,"x",colnum)
dat <- as.data.frame(matrix(runif(rownum*colnum), nrow=rownum, ncol=colnum))
save(dat, file=fn.robj)
# create the table manually to avoid metadata
dat_table <- arrow::as_arrow_table(dat)
arrow::write_feather(x = dat_table, sink = fn.arrow, compression = "uncompressed")
schema <- dat_table$schema
schema$metadata <- NULL
dat_table <- dat_table$cast(schema)
arrow::write_feather(x = dat_table, sink = fn.arrow.nometa, compression = "uncompressed")
}
}
times.robj <- matrix(0, nrow=length(rownums), ncol=length(colnums))
rownames(times.robj) <- paste(rownums,"rows")
colnames(times.robj) <- paste(colnums,"cols")
times.arrow <- times.robj
times.arrow.nometa <- times.robj
for (i in 1:length(rownums)) {
for (j in 1:length(colnums)) {
rownum <- rownums[i]
colnum <- colnums[j]
fn.robj <- paste0(tmpdir, "/robj.",rownum,"x",colnum)
fn.arrow <- paste0(tmpdir, "/arrow.",rownum,"x",colnum)
fn.arrow.nometa <- paste0(tmpdir, "/arrow.nometa.",rownum,"x",colnum)
# measure 2nd load to account for cold caches
load(fn.robj)
start <- Sys.time();
load(fn.robj);
times.robj[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow, as_data_frame = TRUE, mmap = TRUE);
times.arrow[i,j] <- Sys.time()-start
tst <- arrow::read_feather(fn.arrow.nometa)
start <- Sys.time();
tst <- arrow::read_feather(fn.arrow.nometa, as_data_frame = TRUE, mmap = TRUE);
times.arrow.nometa[i,j] <- Sys.time()-start
}
}
```
Gives -
```
> times.arrow.nometa / times.robj
10 cols 20 cols 30 cols 100 cols 150 cols 200 cols
1 rows 70.0114504 58.2468085 46.1319444 21.80683403 11.17302053 17.57530120
2 rows 43.8119658 35.5418327 44.6910569 14.93035480 92.98226950 15.06676136
3 rows 59.6066351 35.9829060 17.2069672 20.61194030 16.86906710 15.61495845
4 rows 236.3318182 44.0948905 31.1320755 16.19062500 18.24731183 8.38811445
5 rows 38.0276498 30.7560976 17.5539419 15.84103512 13.33577713 11.82111801
10 rows 29.7992278 25.0996785 13.4232082 13.74528302 10.35152838 8.64968153
20 rows 25.3423729 19.8398950 17.2170022 9.75327511 7.33414833 8.22432262
30 rows 16.1743697 11.8511628 19.6003824 7.47927032 5.79861111 5.35154017
40 rows 30.6280992 24.8726236 20.2042105 19.87437811 6.08097028 4.42742382
50 rows 29.8060000 22.1587838 18.2661499 3.84229508 6.51073729 3.17507246
60 rows 20.1960298 16.0766610 12.6747851 5.38434983 3.86645595 16.07189542
70 rows 19.9536585 13.5328597 15.0110345 4.48984526 3.51769231 2.88766452
100 rows 17.3659091 11.7341577 7.4166054 4.57267189 3.18088012 2.27087242
200 rows 16.2354892 10.2235047 6.3983116 4.09790752 1.63446432 1.69634703
300 rows 7.3573854 7.1700787 4.5906849 1.63513514 5.10005897 1.15804737
400 rows 6.7309689 4.7252280 4.1573647 1.17293525 0.92025293 0.95345718
500 rows 7.0257590 4.1061644 2.8501041 1.03354651 0.70476702 0.59290404
1000 rows 5.9288681 3.4319209 1.4538153 0.51121076 0.37171409 0.34528247
2000 rows 2.2429879 1.1519025 0.7536303 0.26461660 0.18316701 0.13906527
3000 rows 1.4939711 0.8129300 0.5481969 0.17433917 0.13261505 0.11275994
4000 rows 1.3325031 0.6623176 0.5343539 0.13499381 0.09493322 0.07435454
5000 rows 0.8804653 3.8379121 0.3498308 0.13787353 0.07957967 0.06604075
10000 rows 0.7404644 0.2288824 0.1536465 0.06676482 0.04172911 0.03254246
```
My measurements differ from yours, but even yours show robj wins by a wide margin for <2000 lines.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org