You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/22 15:58:15 UTC
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

jorisvandenbossche commented on a change in pull request #7515:
URL: https://github.com/apache/arrow/pull/7515#discussion_r443663674



##########
File path: python/pyarrow/parquet.py
##########
@@ -1404,27 +1403,36 @@ def __init__(self, path_or_paths, filesystem=None, filters=None,
         self._filter_expression = filters and _filters_to_expression(filters)
 
         # check for single NativeFile dataset
-        if not isinstance(path_or_paths, list):
-            if not _is_path_like(path_or_paths):
-                fragment = parquet_format.make_fragment(path_or_paths)
-                self._dataset = ds.FileSystemDataset(
-                    [fragment], schema=fragment.physical_schema,
-                    format=parquet_format
-                )
-                return
-
-        # map old filesystems to new one
-        # TODO(dataset) deal with other file systems
-        if isinstance(filesystem, LocalFileSystem):
-            filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
-        elif filesystem is None and memory_map:
-            # if memory_map is specified, assume local file system (string
-            # path can in principle be URI for any filesystem)
-            filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
-
-        self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
-                                   format=parquet_format,
-                                   partitioning=partitioning)
+        if (not isinstance(path_or_paths, list) and
+                not _is_path_like(path_or_paths)):
+            fragment = parquet_format.make_fragment(path_or_paths)
+            dataset = ds.FileSystemDataset(
+                [fragment], schema=fragment.physical_schema,
+                format=parquet_format
+            )
+        else:
+            # map old filesystems to new one
+            # TODO(dataset) deal with other file systems
+            if isinstance(filesystem, LocalFileSystem):
+                filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
+            elif filesystem is None and memory_map:
+                # if memory_map is specified, assume local file system (string
+                # path can in principle be URI for any filesystem)
+                filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
+
+            dataset = ds.dataset(path_or_paths, filesystem=filesystem,
+                                 format=parquet_format,
+                                 partitioning=partitioning)
+
+        if split_row_groups:
+            fragments = dataset.get_fragments()
+            fragments = [rg for fragment in fragments
+                         for rg in fragment.split_by_row_group()]
+            dataset = ds.FileSystemDataset(
+                fragments, dataset.schema, dataset.format,
+                dataset.partition_expression
+            )

Review comment:
       This is basically what was requested in ARROW-2801, but I am not fully sure whether it is actually worth adding it here (we are adding it to ParquetDataset for which it is not yet clear we are keeping it in the future). And if we want it, it's maybe rather something to add in the actual Dataset class (or DatasetFactory).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org