You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/10/24 15:38:00 UTC
[jira] [Updated] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times

     [ https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Molina updated ARROW-18114:
--------------------------------------
    Description: 
open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:

 default, unify_schemas=TRUE 
{code:java}
library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
){code}
about 32 seconds for me.

 manual, unify_schemas=FALSE:  
{code:java}
bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
}){code}
takes about 32 seconds as well. 

  was:
open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance. 

Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:

default, unify_schemas=TRUE
library(arrow)
 ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)

bench::bench_time({
open_dataset(ex) 
})
about 32 seconds for me.

manual, unify_schemas=FALSE:

 
bench::bench_time(\{

open_dataset(ex, unify_schemas = FALSE)

})
takes about 32 seconds as well. 


> [R] unify_schemas=FALSE does not improve open_dataset() read times
> ------------------------------------------------------------------
>
>                 Key: ARROW-18114
>                 URL: https://issues.apache.org/jira/browse/ARROW-18114
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Priority: Major
>
> open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance.
> Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:
>  default, unify_schemas=TRUE 
> {code:java}
> library(arrow)
> ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
> bench::bench_time(
> { open_dataset(ex) }
> ){code}
> about 32 seconds for me.
>  manual, unify_schemas=FALSE:  
> {code:java}
> bench::bench_time({
> open_dataset(ex, unify_schemas = FALSE)
> }){code}
> takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)