You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/07 12:56:21 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

jorisvandenbossche commented on a change in pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#discussion_r500973567



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,36 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, inspect_dictionary=False):
         """
         Discover a DirectoryPartitioning.
 
         Parameters
         ----------
         field_names : list of str
             The names to associate with the values from the subdirectory names.
-        max_partition_dictionary_size : int or None, default 0
-            The maximum number of unique values to consider for dictionary
-            encoding. By default no field will be inferred as dictionary
-            encoded. If None is provided dictionary encoding will be used for
-            every string field.
+        inspect_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain. This can be more efficient when

Review comment:
       ```suggestion
               encoded types instead of plain types. This can be more efficient when
   ```

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,36 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, inspect_dictionary=False):

Review comment:
       It might be useful to still accept `max_partition_dictionary_size=None/-1` for now, and if that is passed, set `inspect_dictionary` to True. 
   That would help a bit for dask to write a parquet reader that works both on pyarrow 1.0 and 2.0, on the short term.

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -85,14 +85,11 @@ class ARROW_DS_EXPORT Partitioning {
 };
 
 struct PartitioningFactoryOptions {
-  /// When inferring a schema for partition fields, string fields may be inferred as
-  /// a dictionary type instead. This can be more efficient when materializing virtual
-  /// columns. If the number of discovered unique values of a string field exceeds
-  /// max_partition_dictionary_size, it will instead be inferred as a string.
-  ///
-  /// max_partition_dictionary_size = 0: No fields will be inferred as dictionary.
-  /// max_partition_dictionary_size = -1: All fields will be inferred as dictionary.
-  int max_partition_dictionary_size = 0;
+  /// When inferring a schema for partition fields, yield dictionary encoded types
+  /// instead of plain. This can be more efficient when materializing virtual
+  /// columns, and Expressions parsed by the finished Partitioning will include
+  /// dictionaries of all unique inspected values for each field.
+  bool inspect_dictionary = false;

Review comment:
       Or maybe `use_dictionary` or `infer_dictionary` ?
   The "inspect" is pointing to the fact that we inspect all paths to get the dictionary values? 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org