You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 18:55:05 UTC

[GitHub] [arrow] bkietz opened a new pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

bkietz opened a new pull request #8367:
URL: https://github.com/apache/arrow/pull/8367


   Removed cardinality considerations for inferring partition field types. There is now a boolean flag (`inspect_dictionary`, default false) which if set causes fields to be inferred as a dictionary encoded type instead of the corresponding value type.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#discussion_r501056901



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,36 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, inspect_dictionary=False):

Review comment:
       Alright, I'll add that keyword back and document that it has been replaced




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#discussion_r501100412



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1403,34 +1412,42 @@ cdef class HivePartitioning(Partitioning):
         self.hive_partitioning = <CHivePartitioning*> sp.get()
 
     @staticmethod
-    def discover(object max_partition_dictionary_size=0):
+    def discover(infer_dictionary=False, max_partition_dictionary_size=0):
         """
         Discover a HivePartitioning.
 
-        Params
-        ------
-        max_partition_dictionary_size : int or None, default 0
-            The maximum number of unique values to consider for dictionary
-            encoding. By default no field will be inferred as dictionary
-            encoded. If -1 is provided dictionary encoding will be used for
-            every string field.
+        Parameters
+        ----------
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain. This can be more efficient when
+            materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
 
         Returns
         -------
         PartitioningFactory
             To be used in the FileSystemFactoryOptions.
         """
         cdef:
-            CPartitioningFactoryOptions options
+            CPartitioningFactoryOptions c_options
 
-        if max_partition_dictionary_size is None:
-            max_partition_dictionary_size = -1
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        else if max_partition_dictionary_size != 0:

Review comment:
       ```suggestion
           elif max_partition_dictionary_size != 0:
   ```

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,47 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, infer_dictionary=False,
+                 max_partition_dictionary_size=0):
         """
         Discover a DirectoryPartitioning.
 
         Parameters
         ----------
         field_names : list of str
             The names to associate with the values from the subdirectory names.
-        max_partition_dictionary_size : int or None, default 0
-            The maximum number of unique values to consider for dictionary
-            encoding. By default no field will be inferred as dictionary
-            encoded. If None is provided dictionary encoding will be used for
-            every string field.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
 
         Returns
         -------
         DirectoryPartitioningFactory
             To be used in the FileSystemFactoryOptions.
         """
         cdef:
-            CPartitioningFactoryOptions options
+            CPartitioningFactoryOptions c_options
             vector[c_string] c_field_names
 
-        if max_partition_dictionary_size is None:
-            max_partition_dictionary_size = -1
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        else if max_partition_dictionary_size != 0:

Review comment:
       ```suggestion
           elif max_partition_dictionary_size != 0:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#issuecomment-704490021


   https://issues.apache.org/jira/browse/ARROW-10099


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#discussion_r501056321



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -85,14 +85,11 @@ class ARROW_DS_EXPORT Partitioning {
 };
 
 struct PartitioningFactoryOptions {
-  /// When inferring a schema for partition fields, string fields may be inferred as
-  /// a dictionary type instead. This can be more efficient when materializing virtual
-  /// columns. If the number of discovered unique values of a string field exceeds
-  /// max_partition_dictionary_size, it will instead be inferred as a string.
-  ///
-  /// max_partition_dictionary_size = 0: No fields will be inferred as dictionary.
-  /// max_partition_dictionary_size = -1: All fields will be inferred as dictionary.
-  int max_partition_dictionary_size = 0;
+  /// When inferring a schema for partition fields, yield dictionary encoded types
+  /// instead of plain. This can be more efficient when materializing virtual
+  /// columns, and Expressions parsed by the finished Partitioning will include
+  /// dictionaries of all unique inspected values for each field.
+  bool inspect_dictionary = false;

Review comment:
       I'll rename to `infer_dictionary`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

bkietz closed pull request #8367:
URL: https://github.com/apache/arrow/pull/8367


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8367: ARROW-10099: [C++][Dataset] Simplify type inference for partition columns

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8367:
URL: https://github.com/apache/arrow/pull/8367#discussion_r500973567



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,36 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, inspect_dictionary=False):
         """
         Discover a DirectoryPartitioning.
 
         Parameters
         ----------
         field_names : list of str
             The names to associate with the values from the subdirectory names.
-        max_partition_dictionary_size : int or None, default 0
-            The maximum number of unique values to consider for dictionary
-            encoding. By default no field will be inferred as dictionary
-            encoded. If None is provided dictionary encoding will be used for
-            every string field.
+        inspect_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain. This can be more efficient when

Review comment:
       ```suggestion
               encoded types instead of plain types. This can be more efficient when
   ```

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1321,38 +1321,36 @@ cdef class DirectoryPartitioning(Partitioning):
         self.directory_partitioning = <CDirectoryPartitioning*> sp.get()
 
     @staticmethod
-    def discover(field_names, object max_partition_dictionary_size=0):
+    def discover(field_names, inspect_dictionary=False):

Review comment:
       It might be useful to still accept `max_partition_dictionary_size=None/-1` for now, and if that is passed, set `inspect_dictionary` to True. 
   That would help a bit for dask to write a parquet reader that works both on pyarrow 1.0 and 2.0, on the short term.

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -85,14 +85,11 @@ class ARROW_DS_EXPORT Partitioning {
 };
 
 struct PartitioningFactoryOptions {
-  /// When inferring a schema for partition fields, string fields may be inferred as
-  /// a dictionary type instead. This can be more efficient when materializing virtual
-  /// columns. If the number of discovered unique values of a string field exceeds
-  /// max_partition_dictionary_size, it will instead be inferred as a string.
-  ///
-  /// max_partition_dictionary_size = 0: No fields will be inferred as dictionary.
-  /// max_partition_dictionary_size = -1: All fields will be inferred as dictionary.
-  int max_partition_dictionary_size = 0;
+  /// When inferring a schema for partition fields, yield dictionary encoded types
+  /// instead of plain. This can be more efficient when materializing virtual
+  /// columns, and Expressions parsed by the finished Partitioning will include
+  /// dictionaries of all unique inspected values for each field.
+  bool inspect_dictionary = false;

Review comment:
       Or maybe `use_dictionary` or `infer_dictionary` ?
   The "inspect" is pointing to the fact that we inspect all paths to get the dictionary values? 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org