You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/08/13 09:34:00 UTC

[jira] [Created] (ARROW-9720) [Python] Long-term fate of pyarrow.parquet.ParquetDataset

Joris Van den Bossche created ARROW-9720:
--------------------------------------------

             Summary: [Python] Long-term fate of pyarrow.parquet.ParquetDataset
                 Key: ARROW-9720
                 URL: https://issues.apache.org/jira/browse/ARROW-9720
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Joris Van den Bossche
             Fix For: 2.0.0


The business logic of the python implementation of reading partitioned parquet datasets in {{pyarrow.parquet.ParquetDataset}} has been ported to C++ (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by using {{use_legacy_dataset=False}} (ARROW-8039).

But the question still is: what do we do with this class long term? 

So for users who now do:

{code}
dataset = pq.ParquetDataset(...)
dataset.metadata
table = dataset.read()
{code}

what should they do in the future?  
Do we keep a class like this (but backed by the pyarrow.dataset implementation), or do we deprecate the class entirely, pointing users to `dataset = ds.dataset(..., format="parquet")` ?

In any case, we should strive to entirely delete the current custom python implementation, but we could keep a {{ParquetDataset}} class that wraps or inherits {{pyarrow.dataset.FileSystemDataset}} and adds some parquet specifics to it (eg access to the parquet schema, the common metadata, exposing the parquet-specific constructor keywords more easily, ..). 

Features the {{ParquetDataset}} currently has that are not exactly covered by pyarrow.dataset:

- Partitioning information (the {{.partitions}} attribute
- Access to common metadata ({{.metadata_path}}, {{.common_metadata_path}} and {{.metadata}} attributes)
- ParquetSchema of the dataset




--
This message was sent by Atlassian Jira
(v8.3.4#803005)