You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/07 19:55:26 UTC

[GitHub] [arrow] bkietz opened a new pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

bkietz opened a new pull request #9130:
URL: https://github.com/apache/arrow/pull/9130


   Enables usage of dictionary columns as partition columns on write.
   
   Additionally resolves some partition-related follow ups from #8894 (@pitrou):
   - raise an error status [instead of aborting](https://github.com/apache/arrow/pull/8894/#discussion_r545219042) for overflowing maximum group count
   - handle dictionary index types [other than int32](https://github.com/apache/arrow/pull/8894/#discussion_r545215901) 
   - don't build an unused null bitmap [in CountsToOffsets](https://github.com/apache/arrow/pull/8894/#discussion_r545212237)
   - improve docstrings for [MakeGroupings, ApplyGroupings](https://github.com/apache/arrow/pull/8894/#discussion_r545209568)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555270347



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       You never know if someone has a strange use case requiring a lot of groups, so if there is not a technical reason, I think it's good to just allow it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-758114350


   non apache CI: https://github.com/bkietz/arrow/runs/1682804367


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-759613447


   @jorisvandenbossche @pitrou addressed comments, PTAL


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557409506



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -313,12 +313,16 @@ Status FileSystemDataset::Write(const FileSystemDatasetWriteOptions& write_optio
         ARROW_ASSIGN_OR_RAISE(auto groups, write_options.partitioning->Partition(batch));
         batch.reset();  // drop to hopefully conserve memory
 
+        if (static_cast<int>(groups.batches.size()) > write_options.max_partitions) {

Review comment:
       Rather, cast `max_partitions` to `int64_t`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760029415


   @bkietz Thanks for the updates! I added one more test with the example from the JIRA, so to also have a test that ensures writing a table also works without specifying the dictionaries. 
   
   Is it possible to specify the `max_partitions` option from python?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-756346881


   https://issues.apache.org/jira/browse/ARROW-10247


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555948969



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2315,6 +2315,29 @@ def test_write_dataset_partitioned(tempdir):
         partitioning=partitioning_schema)
 
 
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_partitioned_dict(tempdir):
+    directory = tempdir / "partitioned"
+    _ = _create_parquet_dataset_partitioned(directory)
+
+    # directory partitioning, dictionary partition columns
+    dataset = ds.dataset(
+        directory,
+        partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
+    target = tempdir / 'partitioned-dir-target'
+    expected_paths = [
+        target / "a", target / "a" / "part-0.feather",
+        target / "b", target / "b" / "part-1.feather"
+    ]
+    partitioning_schema = ds.partitioning(pa.schema([
+        dataset.schema.field('part')]),
+        dictionaries=[pa.array(['a', 'b'])])

Review comment:
       For now I've updated the docstring and put a comment in the test to indicate that dictionaries are required for parsing dict fields.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555770771



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1433,12 +1433,19 @@ cdef class DirectoryPartitioning(Partitioning):
     cdef:
         CDirectoryPartitioning* directory_partitioning
 
-    def __init__(self, Schema schema not None):
-        cdef shared_ptr[CDirectoryPartitioning] partitioning
-        partitioning = make_shared[CDirectoryPartitioning](
-            pyarrow_unwrap_schema(schema)
+    def __init__(self, Schema schema not None, dictionaries=None):

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555771737



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       I'll add max groups as a member of WriteOptions




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760886156


   I opened https://issues.apache.org/jira/browse/ARROW-11260 for the "require dictionaries or not" question


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557408484



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1416,6 +1435,11 @@ cdef class DirectoryPartitioning(Partitioning):
     ----------
     schema : Schema
         The schema that describes the partitions present in the file path.
+    dictionaries : List[Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.

Review comment:
       This seems a bit weird and inconvenient as an API. Why not accept a `Dict[str, Array]` mapping field names to dictionaries?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555770569



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<StructArray>> MakeGroupings(const StructArray& by);
 
-/// \brief Produce slices of an Array which correspond to the provided groupings.
+/// \brief Produce a ListArray whose slots are selections of `array` which correspond to
+/// the provided groupings.
+///
+/// For example,
+///   ApplyGroupings([[0, 1, 4], [2, 3], [5]], [
+///       {"a": "ex",  "b": 0, "passenger": 0},
+///       {"a": "ex",  "b": 0, "passenger": 1},
+///       {"a": "why", "b": 0, "passenger": 2},
+///       {"a": "why", "b": 0, "passenger": 3},
+///       {"a": "ex",  "b": 0, "passenger": 4},
+///       {"a": "why", "b": 1, "passenger": 5}
+///   ]) == [
+///     [
+///       {"a": "ex",  "b": 0, "passenger": 0},
+///       {"a": "ex",  "b": 0, "passenger": 1},
+///       {"a": "ex",  "b": 0, "passenger": 4},
+///     ],
+///     [
+///       {"a": "why", "b": 0, "passenger": 2},
+///       {"a": "why", "b": 0, "passenger": 3},
+///     ],
+///     [
+///       {"a": "why", "b": 1, "passenger": 5}
+///     ]
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<ListArray>> ApplyGroupings(const ListArray& groupings,
                                                   const Array& array);
+
+/// \brief Produce slices of a RecordBatch which correspond to the provided groupings.

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #9130:
URL: https://github.com/apache/arrow/pull/9130


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555948969



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2315,6 +2315,29 @@ def test_write_dataset_partitioned(tempdir):
         partitioning=partitioning_schema)
 
 
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_partitioned_dict(tempdir):
+    directory = tempdir / "partitioned"
+    _ = _create_parquet_dataset_partitioned(directory)
+
+    # directory partitioning, dictionary partition columns
+    dataset = ds.dataset(
+        directory,
+        partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
+    target = tempdir / 'partitioned-dir-target'
+    expected_paths = [
+        target / "a", target / "a" / "part-0.feather",
+        target / "b", target / "b" / "part-1.feather"
+    ]
+    partitioning_schema = ds.partitioning(pa.schema([
+        dataset.schema.field('part')]),
+        dictionaries=[pa.array(['a', 'b'])])

Review comment:
       For now I've updated the docstring and put a comment in the test to indicate that dictionaries are required fro parsing dict fields.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557519605



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -313,12 +313,16 @@ Status FileSystemDataset::Write(const FileSystemDatasetWriteOptions& write_optio
         ARROW_ASSIGN_OR_RAISE(auto groups, write_options.partitioning->Partition(batch));
         batch.reset();  // drop to hopefully conserve memory
 
+        if (static_cast<int>(groups.batches.size()) > write_options.max_partitions) {

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555279313



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       Then it's definitely worth having a reasonably small configurable limit (such as 100). I suspect it's easy to end up with Arrow creating a million files if you do a mistake in choosing your partition columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555126282



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -230,9 +236,9 @@ std::vector<KeyValuePartitioning::Key> DirectoryPartitioning::ParseKeys(
   return keys;
 }
 
-inline util::optional<int> NextValid(const std::vector<Scalar*>& values, int first_null) {
+inline util::optional<int> NextValid(const ScalarVector& values, int first_null) {
   auto it = std::find_if(values.begin() + first_null + 1, values.end(),
-                         [](Scalar* v) { return v != nullptr; });
+                         [](const std::shared_ptr<Scalar> v) { return v != nullptr; });

Review comment:
       `const&`?

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       Can you comment on this? It's not obvious why we're limited by the size of an `int16_t`.

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<StructArray>> MakeGroupings(const StructArray& by);
 
-/// \brief Produce slices of an Array which correspond to the provided groupings.
+/// \brief Produce a ListArray whose slots are slices of `array` which correspond to the

Review comment:
       Perhaps unimportant, but at least in Python a slice is a contiguous subsequence. That said, the example dissipates the ambiguity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555273007



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       Yes, at least one file for each group




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760801124


   Since both @pitrou and I got a bit confused about this `dictionaries`, we should maybe try to further clarify the documentation around it. A more explicit test could maybe also help (although I didn't check the C++ tests).  
   (can be a follow-up, since I think needs to get in for the release?)
   
   As I understand it now:
   
   - It's only needed in `ds.dataset(..)` when passing a schema, i.e. which creates an actual Partitioning object, and not a PartitioningFactory (which will infer the schema (and the dictionary values) from the file paths)
   - In addition, it's only needed to specify it when reading, and not when writing with a Partitioning (so can create and use a schema-based Partitioning object without specifying the `dictionaries`).  
     This is the "only required when parsing paths" (the docstring says "or an error will be raised in parsing"), since we don't need to parse paths when writing. But for a user the "parsing paths" == "reading" (in practice) might not necessarily be clear.
   
   This behaviour of requiring explicit dictionaries when reading a dataset with a Partitioning object with a schema including dictionary fields already exists in 1.0 and 2.0 (only without any way to get around the error "No dictionary provided for dictionary field", except by letting the partitioning be discovered instead of specifying a schema). So that's certainly fine for 3.0.0 as well. 
    
   But, I am personally still wondering if we can't allow this for reading as well to have those dictionaries unspecified but discovered, even when specifying an explicit schema (eg it allows to have mixed dictionary / non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 (and I added a test about that in the PR fixing it (https://github.com/apache/arrow/pull/6641#issuecomment-600746259), but that got apparently lost in a rebase ;)), but I suppose this was changed after ensuring that the dictionary-typed partition fields "know" the full dictionary of all possible values the dataset (https://github.com/apache/arrow/pull/7536#issuecomment-649500017). 
   I can open a JIRA about this to discuss further.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557409506



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -313,12 +313,16 @@ Status FileSystemDataset::Write(const FileSystemDatasetWriteOptions& write_optio
         ARROW_ASSIGN_OR_RAISE(auto groups, write_options.partitioning->Partition(batch));
         batch.reset();  // drop to hopefully conserve memory
 
+        if (static_cast<int>(groups.batches.size()) > write_options.max_partitions) {

Review comment:
       Rather, cast `max_partitions` to `size_t`.

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -313,12 +313,16 @@ Status FileSystemDataset::Write(const FileSystemDatasetWriteOptions& write_optio
         ARROW_ASSIGN_OR_RAISE(auto groups, write_options.partitioning->Partition(batch));
         batch.reset();  // drop to hopefully conserve memory
 
+        if (static_cast<int>(groups.batches.size()) > write_options.max_partitions) {

Review comment:
       Rather, cast `max_partitions` to `size_t` or `int64_t`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760312991


   @pitrou The dictionaries are [a feature inspired by ParquetDataset](https://github.com/apache/arrow/pull/7536#issuecomment-649500017): it's useful for each partition expression to contain the dictionary of all unique values that field could take. They are only required when parsing paths. When constructing a Partitioning from a factory (inferring fields from a vector of paths) the dictionaries are assembled automatically. However if the Partitioning is being directly constructed then the dictionaries must be explicitly specified.
   
   @jorisvandenbossche I'll add a binding for `max_partitions` to python


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555176556



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       I picked it arbitrarily, to be honest. A huge number of groups seemed likely to be an error to see who would ask about it. Should we instead allow the full range of int32? @jorisvandenbossche 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557520742



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1403,6 +1403,25 @@ cdef class PartitioningFactory(_Weakrefable):
         return self.wrapped
 
 
+cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
+        Schema schema, dictionaries) except *:
+    cdef:
+        vector[shared_ptr[CArray]] c_dictionaries
+
+    dictionaries = list(dictionaries or [])[:len(schema)]
+    while len(dictionaries) < len(schema):
+        dictionaries.append(None)
+
+    for field, dictionary in zip(schema, dictionaries):
+        if (isinstance(field.type, pa.DictionaryType) and
+                dictionary is not None):
+            c_dictionaries.push_back(pyarrow_unwrap_array(dictionary))
+        else:
+            c_dictionaries.push_back(<shared_ptr[CArray]> nullptr)

Review comment:
       I'll adopt the `Dict[str, Array]` pattern, which will remove this discrepancy from the python interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555181548



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       For now I'll remove the constant `kMaxGroups` and allow any int32 group count.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555173519



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -230,9 +236,9 @@ std::vector<KeyValuePartitioning::Key> DirectoryPartitioning::ParseKeys(
   return keys;
 }
 
-inline util::optional<int> NextValid(const std::vector<Scalar*>& values, int first_null) {
+inline util::optional<int> NextValid(const ScalarVector& values, int first_null) {
   auto it = std::find_if(values.begin() + first_null + 1, values.end(),
-                         [](Scalar* v) { return v != nullptr; });
+                         [](const std::shared_ptr<Scalar> v) { return v != nullptr; });

Review comment:
       yes, thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-756372363


   non apache CI: https://github.com/bkietz/arrow/runs/1664709650


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555177802



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<StructArray>> MakeGroupings(const StructArray& by);
 
-/// \brief Produce slices of an Array which correspond to the provided groupings.
+/// \brief Produce a ListArray whose slots are slices of `array` which correspond to the

Review comment:
       I'll use "selections" instead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555285509



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       As long as it is configurable, then that is fine for me. 
   But I think something like 100 is too small. For example, the NYC taxi dataset partitioned by year + month for 8 years of data already has 8*12 = 96 groups. And I think partitioning by day is not that uncommon in practice for big data (although for those cases you will probably not write that all at once)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555271701



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       I don't know, is a separate file created for each group? If so, it makes sense to put a configurable limit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r558189535



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1403,6 +1403,25 @@ cdef class PartitioningFactory(_Weakrefable):
         return self.wrapped
 
 
+cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
+        Schema schema, dictionaries) except *:
+    cdef:
+        vector[shared_ptr[CArray]] c_dictionaries
+
+    dictionaries = list(dictionaries or [])[:len(schema)]
+    while len(dictionaries) < len(schema):
+        dictionaries.append(None)
+
+    for field, dictionary in zip(schema, dictionaries):
+        if (isinstance(field.type, pa.DictionaryType) and
+                dictionary is not None):
+            c_dictionaries.push_back(pyarrow_unwrap_array(dictionary))
+        else:
+            c_dictionaries.push_back(<shared_ptr[CArray]> nullptr)

Review comment:
       The entry in `dictionaries` is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555275756



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]

Review comment:
       Nice docstrings! Those examples help a lot

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<StructArray>> MakeGroupings(const StructArray& by);
 
-/// \brief Produce slices of an Array which correspond to the provided groupings.
+/// \brief Produce a ListArray whose slots are selections of `array` which correspond to
+/// the provided groupings.
+///
+/// For example,
+///   ApplyGroupings([[0, 1, 4], [2, 3], [5]], [
+///       {"a": "ex",  "b": 0, "passenger": 0},
+///       {"a": "ex",  "b": 0, "passenger": 1},
+///       {"a": "why", "b": 0, "passenger": 2},
+///       {"a": "why", "b": 0, "passenger": 3},
+///       {"a": "ex",  "b": 0, "passenger": 4},
+///       {"a": "why", "b": 1, "passenger": 5}
+///   ]) == [
+///     [
+///       {"a": "ex",  "b": 0, "passenger": 0},
+///       {"a": "ex",  "b": 0, "passenger": 1},
+///       {"a": "ex",  "b": 0, "passenger": 4},
+///     ],
+///     [
+///       {"a": "why", "b": 0, "passenger": 2},
+///       {"a": "why", "b": 0, "passenger": 3},
+///     ],
+///     [
+///       {"a": "why", "b": 1, "passenger": 5}
+///     ]
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<ListArray>> ApplyGroupings(const ListArray& groupings,
                                                   const Array& array);
+
+/// \brief Produce slices of a RecordBatch which correspond to the provided groupings.

Review comment:
       Also here "slices" -> "selections" ?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1433,12 +1433,19 @@ cdef class DirectoryPartitioning(Partitioning):
     cdef:
         CDirectoryPartitioning* directory_partitioning
 
-    def __init__(self, Schema schema not None):
-        cdef shared_ptr[CDirectoryPartitioning] partitioning
-        partitioning = make_shared[CDirectoryPartitioning](
-            pyarrow_unwrap_schema(schema)
+    def __init__(self, Schema schema not None, dictionaries=None):

Review comment:
       Can you update the above docstring for this new keyword? (and same for HivePartitioning, and partitioning() below) 

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2315,6 +2315,29 @@ def test_write_dataset_partitioned(tempdir):
         partitioning=partitioning_schema)
 
 
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_partitioned_dict(tempdir):
+    directory = tempdir / "partitioned"
+    _ = _create_parquet_dataset_partitioned(directory)
+
+    # directory partitioning, dictionary partition columns
+    dataset = ds.dataset(
+        directory,
+        partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
+    target = tempdir / 'partitioned-dir-target'
+    expected_paths = [
+        target / "a", target / "a" / "part-0.feather",
+        target / "b", target / "b" / "part-1.feather"
+    ]
+    partitioning_schema = ds.partitioning(pa.schema([
+        dataset.schema.field('part')]),
+        dictionaries=[pa.array(['a', 'b'])])

Review comment:
       Or even more, it doesn't seem to impact the result if I set different values there than the actual values in the partition column

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2315,6 +2315,29 @@ def test_write_dataset_partitioned(tempdir):
         partitioning=partitioning_schema)
 
 
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_partitioned_dict(tempdir):
+    directory = tempdir / "partitioned"
+    _ = _create_parquet_dataset_partitioned(directory)
+
+    # directory partitioning, dictionary partition columns
+    dataset = ds.dataset(
+        directory,
+        partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
+    target = tempdir / 'partitioned-dir-target'
+    expected_paths = [
+        target / "a", target / "a" / "part-0.feather",
+        target / "b", target / "b" / "part-1.feather"
+    ]
+    partitioning_schema = ds.partitioning(pa.schema([
+        dataset.schema.field('part')]),
+        dictionaries=[pa.array(['a', 'b'])])

Review comment:
       Are you required to pass those dictionaries for writing? 
   Based on a quick test locally, it seems to work without as well? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555785945



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -2315,6 +2315,29 @@ def test_write_dataset_partitioned(tempdir):
         partitioning=partitioning_schema)
 
 
+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_partitioned_dict(tempdir):
+    directory = tempdir / "partitioned"
+    _ = _create_parquet_dataset_partitioned(directory)
+
+    # directory partitioning, dictionary partition columns
+    dataset = ds.dataset(
+        directory,
+        partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
+    target = tempdir / 'partitioned-dir-target'
+    expected_paths = [
+        target / "a", target / "a" / "part-0.feather",
+        target / "b", target / "b" / "part-1.feather"
+    ]
+    partitioning_schema = ds.partitioning(pa.schema([
+        dataset.schema.field('part')]),
+        dictionaries=[pa.array(['a', 'b'])])

Review comment:
       That's surprising; you should see errors: `No dictionary provided for dictionary field part: dictionary<values=string, indices=int32, ordered=0>` if you specify an incorrect dictionary and `Dictionary supplied for field part: dictionary<values=string, indices=int32, ordered=0> does not contain 'a'` if you specify a dictionary which doesn't include all the column's values
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557407889



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1403,6 +1403,25 @@ cdef class PartitioningFactory(_Weakrefable):
         return self.wrapped
 
 
+cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
+        Schema schema, dictionaries) except *:
+    cdef:
+        vector[shared_ptr[CArray]] c_dictionaries
+
+    dictionaries = list(dictionaries or [])[:len(schema)]
+    while len(dictionaries) < len(schema):
+        dictionaries.append(None)
+
+    for field, dictionary in zip(schema, dictionaries):
+        if (isinstance(field.type, pa.DictionaryType) and
+                dictionary is not None):
+            c_dictionaries.push_back(pyarrow_unwrap_array(dictionary))
+        else:
+            c_dictionaries.push_back(<shared_ptr[CArray]> nullptr)

Review comment:
       `dictionary` is ignored here? That doesn't sound right.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r557520742



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1403,6 +1403,25 @@ cdef class PartitioningFactory(_Weakrefable):
         return self.wrapped
 
 
+cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
+        Schema schema, dictionaries) except *:
+    cdef:
+        vector[shared_ptr[CArray]] c_dictionaries
+
+    dictionaries = list(dictionaries or [])[:len(schema)]
+    while len(dictionaries) < len(schema):
+        dictionaries.append(None)
+
+    for field, dictionary in zip(schema, dictionaries):
+        if (isinstance(field.type, pa.DictionaryType) and
+                dictionary is not None):
+            c_dictionaries.push_back(pyarrow_unwrap_array(dictionary))
+        else:
+            c_dictionaries.push_back(<shared_ptr[CArray]> nullptr)

Review comment:
       I'll adopt the `Dict[str, Array]` pattern, which will remove this discrepancy




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#issuecomment-760211107


   While it's not a blocker, I admit I don't understand why it's necessary to pass dictionary values, and in which case (only for writing?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org