You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/16 09:35:02 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

jorisvandenbossche commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r440673889



##########
File path: python/pyarrow/dataset.py
##########
@@ -445,7 +446,8 @@ def _union_dataset(children, schema=None, **kwargs):
     return UnionDataset(schema, children)
 
 
-def parquet_dataset(metadata_path, schema=None, filesystem=None, format=None):
+def parquet_dataset(metadata_path, schema=None, filesystem=None, format=None,
+                    partitioning=None, partition_base_dir=None):

Review comment:
       Can you also update the docstring here?

##########
File path: cpp/src/arrow/dataset/file_parquet.h
##########
@@ -215,6 +215,34 @@ class ARROW_DS_EXPORT ParquetFileFragment : public FileFragment {
   friend class ParquetFileFormat;
 };
 
+struct ParquetFactoryOptions {
+  // Either an explicit Partitioning or a PartitioningFactory to discover one.
+  //
+  // If a factory is provided, it will be used to infer a schema for partition fields
+  // based on file and directory paths then construct a Partitioning. The default
+  // is a Partitioning which will yield no partition information.
+  //
+  // The (explicit or discovered) partitioning will be applied to discovered files
+  // and the resulting partition information embedded in the Dataset.
+  PartitioningOrFactory partitioning{Partitioning::Default()};
+
+  // For the purposes of applying the partitioning, paths will be stripped
+  // of the partition_base_dir. Files not matching the partition_base_dir
+  // prefix will be skipped for partition discovery. The ignored files will still
+  // be part of the Dataset, but will not have partition information.
+  //
+  // Example:
+  // partition_base_dir = "/dataset";
+  //
+  // - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the partitioning
+  //
+  // - "/home/john/late_sales.csv" -> Will be ignored for partition discovery.
+  //
+  // This is useful for partitioning which parses directory when ordering
+  // is important, e.g. DirectoryPartitioning.
+  std::string partition_base_dir;

Review comment:
       Does this have a default? (eg I would assume the base directory of `_metadata` file)

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -618,15 +641,37 @@ ParquetDatasetFactory::CollectParquetFragments(
   }
 }
 
+Result<std::vector<std::shared_ptr<Schema>>> ParquetDatasetFactory::InspectSchemas(
+    InspectOptions options) {
+  std::shared_ptr<Schema> schema;
+  RETURN_NOT_OK(parquet::arrow::FromParquetSchema(metadata_->schema(), &schema));
+
+  // Gather paths found in RowGroups' ColumnChunks.
+  auto properties = MakeArrowReaderProperties(*format_, *metadata_);
+  ARROW_ASSIGN_OR_RAISE(auto paths, CollectPaths(*metadata_, properties));

Review comment:
       Should we only collect those paths here again if there is an actual partitioning? (when `options_.partitioning` is not the default "no partitioning") 
   (I don't know how costly this is, probably not much)

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1493,6 +1493,80 @@ cdef class UnionDatasetFactory(DatasetFactory):
         self.union_factory = <CUnionDatasetFactory*> sp.get()
 
 
+cdef class ParquetFactoryOptions:
+    """
+    Influences the discovery of parquet dataset.
+
+    Parameters
+    ----------
+    partition_base_dir : str, optional
+        For the purposes of applying the partitioning, paths will be
+        stripped of the partition_base_dir. Files not matching the
+        partition_base_dir prefix will be skipped for partitioning discovery.
+        The ignored files will still be part of the Dataset, but will not
+        have partition information.
+    """

Review comment:
       Can you include an entry for the `partitioning` parameter as well? 
   (for some reason it seems this is also not included in the FileSystemFactoryOptions, but I think it is better to include)

##########
File path: python/pyarrow/dataset.py
##########
@@ -486,8 +488,13 @@ def parquet_dataset(metadata_path, schema=None, filesystem=None, format=None):
         filesystem, _ = _ensure_filesystem(filesystem)
 
     metadata_path = _normalize_path(filesystem, _stringify_path(metadata_path))
+    options = ParquetFactoryOptions(
+        partition_base_dir=partition_base_dir,
+        partitioning=partitioning

Review comment:
       Can you add here a `partitioning = _ensure_partitioning(partitioning)` before passing it to ParquetFactoryOptions? 
   
   (that will eg make sure that `parquet_dataset(..., partitioning="hive")` will also work, similarly as for `dataset(..)`)

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1522,19 +1522,20 @@ def _create_parquet_dataset_partitioned(root_path):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_parquet_dataset_factory_partitioned(tempdir):
-    # TODO support for specifying partitioning scheme
-
     root_path = tempdir / "test_parquet_dataset_factory_partitioned"
     metadata_path, table = _create_parquet_dataset_partitioned(root_path)
 
-    dataset = ds.parquet_dataset(metadata_path)
-    # TODO partition column not yet included
-    # assert dataset.schema.equals(table.schema)
+    partitioning = ds.partitioning(flavor="hive")
+    dataset = ds.parquet_dataset(metadata_path,
+                                 partitioning=partitioning,
+                                 partition_base_dir=str(root_path))

Review comment:
       Related to my previous question about the default of this keyword: is it in this case needed to pass this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org