You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Charlton Callender <ch...@uw.edu> on 2020/12/23 05:05:56 UTC

[R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Hi

I am starting to use arrow in a workflow where I have a dataset partitioned by a couple variables (like location and year) that leads to > 100,000 parquet files.

I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas = FALSE)` but found this is taking a couple minutes to run. I can see that almost all the time is spent on this line creating the FileSystemDatasetFactory. https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L135

In my use case I know all the partition file paths and I know the schema (and that it is consistent across partitions). Is there any way to use that information to more quickly create the Dataset object with a highly partitioned dataset?

I found this section in the Python docs about creating a dataset from filepaths, is this possible to do from R? https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset

Thank you! I’ve been finding arrow/parquet really useful as an alternative to hdf5 and csv.

Re: [R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Posted by Charlton Callender <ch...@uw.edu>.

Thank you for the response Neal. That is helpful to know there is an open issue for this to support from R and I will watch the issue for updates.

From: Neal Richardson <ne...@gmail.com>
Reply-To: "user@arrow.apache.org" <us...@arrow.apache.org>
Date: Wednesday, December 23, 2020 at 8:50 AM
To: "user@arrow.apache.org" <us...@arrow.apache.org>
Subject: Re: [R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Thanks for the report. The R bindings to the C++ methods that pyarrow is using in the docs you linked haven't been written yet. https://issues.apache.org/jira/browse/ARROW-9657 is the open issue for that. I agree that it would be good to support from R.

A couple of minutes also seems a bit slow even for the case where you don't provide the file paths, so that would be worth investigating as well.

Neal

On Tue, Dec 22, 2020 at 9:08 PM Charlton Callender <ch...@uw.edu>> wrote:
Hi

I am starting to use arrow in a workflow where I have a dataset partitioned by a couple variables (like location and year) that leads to > 100,000 parquet files.

I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas = FALSE)` but found this is taking a couple minutes to run. I can see that almost all the time is spent on this line creating the FileSystemDatasetFactory. https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L135

In my use case I know all the partition file paths and I know the schema (and that it is consistent across partitions). Is there any way to use that information to more quickly create the Dataset object with a highly partitioned dataset?

I found this section in the Python docs about creating a dataset from filepaths, is this possible to do from R? https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset

Thank you! I’ve been finding arrow/parquet really useful as an alternative to hdf5 and csv.

Re: [R][Dataset] how to speed up creating FileSystemDatasetFactory from a large partitioned dataset?

Posted by Neal Richardson <ne...@gmail.com>.

Thanks for the report. The R bindings to the C++ methods that pyarrow is
using in the docs you linked haven't been written yet.
https://issues.apache.org/jira/browse/ARROW-9657 is the open issue for
that. I agree that it would be good to support from R.

A couple of minutes also seems a bit slow even for the case where you don't
provide the file paths, so that would be worth investigating as well.

Neal

On Tue, Dec 22, 2020 at 9:08 PM Charlton Callender <ch...@uw.edu> wrote:

> Hi
>
>
>
> I am starting to use arrow in a workflow where I have a dataset
> partitioned by a couple variables (like location and year) that leads to >
> 100,000 parquet files.
>
>
>
> I have been using `arrow::open_dataset(sources = FILEPATH, unify_schemas =
> FALSE)` but found this is taking a couple minutes to run. I can see that
> almost all the time is spent on this line creating the
> FileSystemDatasetFactory.
> https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L135
>
>
>
> In my use case I know all the partition file paths and I know the schema
> (and that it is consistent across partitions). Is there any way to use that
> information to more quickly create the Dataset object with a highly
> partitioned dataset?
>
>
>
> I found this section in the Python docs about creating a dataset from
> filepaths, is this possible to do from R?
> https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset
>
>
>
> Thank you! I’ve been finding arrow/parquet really useful as an alternative
> to hdf5 and csv.
>