You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/12/15 21:44:00 UTC
[jira] [Commented] (ARROW-18114) [R] unify_schemas=FALSE does not improve open_dataset() read times
[ https://issues.apache.org/jira/browse/ARROW-18114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648256#comment-17648256 ]
Carl Boettiger commented on ARROW-18114:
----------------------------------------
Just an additional comment that this behavior also seems to occur whether or not schema is specified manually, as well as when unfied_schemas=FALSE (i.e. determined from the first parquet file). Here's another more extreme example owing to an even larger number of partitions:
{code:java}
forecast_schema <- function() {
arrow::schema(target_id = arrow::string(),
datetime = arrow::timestamp("us", timezone = "UTC"),
parameter=arrow::string(),
variable = arrow::string(),
prediction=arrow::float64(),
family=arrow::string(),
reference_datetime=arrow::string(),
site_id=arrow::string(),
model_id = arrow::string(),
date=arrow::string()
)
}
s3 <- arrow::s3_bucket("neon4cast-forecasts/parquet/phenology", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
ds <- arrow::open_dataset(s3, schema=forecast_schema()) {code}
> [R] unify_schemas=FALSE does not improve open_dataset() read times
> ------------------------------------------------------------------
>
> Key: ARROW-18114
> URL: https://issues.apache.org/jira/browse/ARROW-18114
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Carl Boettiger
> Priority: Major
>
> open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema. This ought to provide a substantial performance increase in contexts where the schema is known in advance.
> Unfortunately, in my tests it seems to have no impact on performance. Consider the following reprexes:
> default, unify_schemas=TRUE
> {code:java}
> library(arrow)
> ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
> bench::bench_time(
> { open_dataset(ex) }
> ){code}
> about 32 seconds for me.
> manual, unify_schemas=FALSE:
> {code:java}
> bench::bench_time({
> open_dataset(ex, unify_schemas = FALSE)
> }){code}
> takes about 32 seconds as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)