You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/26 03:37:44 UTC

[GitHub] [arrow] westonpace opened a new pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

westonpace opened a new pull request #9323:
URL: https://github.com/apache/arrow/pull/9323


   *This is a draft, right now only the lower level compute/partitioning parts are done*
   
   Tested and added support for partitioning with nulls.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r576476902



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(

Review comment:
       I added a parameter so this test will now cover both inferred dictionary and normal reading.  I renamed it to just `test_partition_discovery` since the test case now covers several methods of discovery.  I've also added tests for writing and a few other situations.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r581079314



##########
File path: cpp/src/arrow/dataset/expression.cc
##########
@@ -698,16 +701,25 @@ Status ExtractKnownFieldValuesImpl(
                          return !(ref && lit);
                        }
 
+                       if (call->function_name == "is_null") {
+                         auto ref = call->arguments[0].field_ref();
+                         return !ref;
+                       }
+
                        return true;
                      });
 
   for (auto it = unconsumed_end; it != conjunction_members->end(); ++it) {
     auto call = CallNotNull(*it);
 
-    auto ref = call->arguments[0].field_ref();
-    auto lit = call->arguments[1].literal();
-
-    known_values->emplace(*ref, *lit);
+    if (call->function_name == "equal") {
+      auto ref = call->arguments[0].field_ref();
+      auto lit = call->arguments[1].literal();
+      known_values->emplace(*ref, *lit);
+    } else if (call->function_name == "is_null") {
+      auto ref = call->arguments[0].field_ref();
+      known_values->emplace(*ref, std::make_shared<NullScalar>());

Review comment:
       It looks like gcc 4.8's `emplace` isn't clever enough to figure this out: https://github.com/apache/arrow/pull/9323/checks?check_run_id=1957359093#step:9:1151
   ```suggestion
         known_values->emplace(*ref, Datum(std::make_shared<NullScalar>()));
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #9323:
URL: https://github.com/apache/arrow/pull/9323


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r565630529



##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       That would work. Here's an example from [filter](https://github.com/bkietz/arrow/blob/bead1db7ab5978331ee2622d1cca0911ae939c52/cpp/src/arrow/compute/api_vector.h#L109-L112)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264814



##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       I removed SKIP and added some comments similar to those in filter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264655



##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r564198742



##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       These changes aren't related to the PR and I can pull them out but I'd like to get them in at some point.  The GDB pretty prints (python scripts) aren't included in this PR but I use them for debugging and they rely on these functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#issuecomment-767274981


   https://issues.apache.org/jira/browse/ARROW-10438


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r565550260



##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       Yes, sorry, I will discard the SKIP/IGNORE option.  At one point I thought that was the current behavior and so I added the option.  Later testing showed that I simply misunderstood what was happening and I forgot to fully remove the option.  ENCODE and MASK are the only options.
   
   @bkietz I'm not sure what inline examples of such behavior would mean.  Something like this maybe...
   
   ```
   ENCODE [5, 5, null] -> Dictionary [5, null] / Indices [0, 0, 1]
   MASK [5, 5, null] -> Dictionary [5] / Indices [0, 0, null]
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#issuecomment-767274981


   https://issues.apache.org/jira/browse/ARROW-10438


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575328856



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(
+    tempdir, partitioning, null_fallback, partition_keys
+):
     # ARROW-9288 / ARROW-9476
     import pyarrow.parquet as pq
-    table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
+
+    table = pa.table({'a': range(9), 'b': [0.0] * 4 + [1.0] * 5})
+
+    if None in partition_keys[0] or None in partition_keys[1]:
+        # Directory partitioning can't handle the first part being null
+        return

Review comment:
       only return here if `partitioning == "directory"` ?

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])

Review comment:
       What does `null_fallback=None` mean? (based on the docstring above it seems it can only be a string?)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const Expression& expr,
                                                       RecordBatchProjector* projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-    if (!ref_value.second.is_scalar()) {
-      return Status::Invalid("non-scalar partition key ", ref_value.second.ToString());
+    const auto& known_value = ref_value.second;
+    if (known_value.concrete() && !known_value.datum.is_scalar()) {
+      return Status::Invalid("non-scalar partition key ", known_value.datum.ToString());
     }
 
     ARROW_ASSIGN_OR_RAISE(auto match,
                           ref_value.first.FindOneOrNone(*projector->schema()));
 
     if (match.empty()) continue;
-    RETURN_NOT_OK(projector->SetDefaultValue(match, ref_value.second.scalar()));
+
+    const auto& field = projector->schema()->field(match[0]);
+    if (known_value.concrete()) {
+      RETURN_NOT_OK(projector->SetDefaultValue(match, known_value.datum.scalar()));
+    } else if (known_value.valid) {
+      return Status::Invalid(
+          "Partition expression not defined enough to set default value for ",

Review comment:
       What does "not defined enough" in practice mean? (or what would be an example?)

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(

Review comment:
       you added this to a test that is specifically about reading partitioned datasets while inferring the partition fields as dictionary. Which is fine (as this case also needs to be able to hand that), but this should also work (and so be tested) in the default case not inferring dictionary type? 
   And should we also have a test for the writing part? (this one only tests reading)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580699988



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -147,6 +152,8 @@ class ValueCountsAction final : ActionBase {
     }
   }
 
+  bool ShouldEncodeNulls() { return true; }

Review comment:
       This appears to fail on GCC 6.3.  See https://github.com/apache/arrow/pull/9323/checks?check_run_id=1956850616 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66297
   
   Suggestions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568266523



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -451,8 +487,12 @@ struct HashKernelTraits<Type, Action, enable_if_has_string_view<Type>> {
 template <typename Type, typename Action>
 std::unique_ptr<HashKernel> HashInitImpl(KernelContext* ctx, const KernelInitArgs& args) {
   using HashKernelType = typename HashKernelTraits<Type, Action>::HashKernel;
-  auto result = ::arrow::internal::make_unique<HashKernelType>(args.inputs[0].type,
-                                                               ctx->memory_pool());
+  DictionaryEncodeOptions options;
+  if (auto options_ptr = static_cast<const DictionaryEncodeOptions*>(args.options)) {

Review comment:
       I moved the options decoding into Action although the `HashKernel` still needed to be aware of it (to decide if `null` should go in the dictionary or not) so it changed the kernel->action interface slightly (added `ShouldEncodeNulls`)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r564831474



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -451,8 +487,12 @@ struct HashKernelTraits<Type, Action, enable_if_has_string_view<Type>> {
 template <typename Type, typename Action>
 std::unique_ptr<HashKernel> HashInitImpl(KernelContext* ctx, const KernelInitArgs& args) {
   using HashKernelType = typename HashKernelTraits<Type, Action>::HashKernel;
-  auto result = ::arrow::internal::make_unique<HashKernelType>(args.inputs[0].type,
-                                                               ctx->memory_pool());
+  DictionaryEncodeOptions options;
+  if (auto options_ptr = static_cast<const DictionaryEncodeOptions*>(args.options)) {

Review comment:
       This seems fragile; if other functions which use `HashInitImpl` add options this will break. Would it be possible to branch on the Action instead?
   
   ```suggestion
     if (std::is_same<Action, DictEncodeAction::value) {
   ```
   
   (currently dictionary_encode uses ValueCountsAction in Init, which is confusing and would invalidate the above approach)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
-    if (column.type()->id() != Type::DICTIONARY) {
-      ARROW_ASSIGN_OR_RAISE(column, compute::DictionaryEncode(std::move(column)));
+    if (column.type()->id() == Type::DICTIONARY) {
+      // compute::DictionaryEncode doesn't support dictionary and, even if it did, it
+      // would be a null op and return a flat dictionary.  In order to group by dictionary
+      // we would need to be able to create a nested dictionary.
+      return Status::NotImplemented(
+          "Cannot use column of type dictionary as grouping criteria");

Review comment:
       Please revert this: in case of a dictionary array, dictionary encode is unnecessary as we already have the dictionary and the indices and they can be used unchanged below.

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -296,7 +315,8 @@ Result<std::shared_ptr<StructArray>> ValueCounts(const Datum& value,
 /// \since 1.0.0
 /// \note API not yet finalized
 ARROW_EXPORT
-Result<Datum> DictionaryEncode(const Datum& data, ExecContext* ctx = NULLPTR);
+Result<Datum> DictionaryEncode(const Datum& data, const DictionaryEncodeOptions& options,

Review comment:
       You could provide the default dict encode options here:
   ```suggestion
   Result<Datum> DictionaryEncode(const Datum& data,
                                  const DictionaryEncodeOptions& options = DictionaryEncodeOptions::Defaults(),
   ```

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       These are confusingly worded. It'd also be good to have inline examples of each behavior
   ```suggestion
       /// nulls result in an index referencing a null slot in the dictionary
       ENCODE,
       /// nulls result in a null index (does not correspond to any slot in the dictionary)
       MASK,
       /// nulls are ignored
       IGNORE
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_hash_test.cc
##########
@@ -649,13 +692,42 @@ TEST_F(TestHashKernel, ChunkedArrayInvoke) {
 TEST_F(TestHashKernel, ZeroLengthDictionaryEncode) {
   // ARROW-7008
   auto values = ArrayFromJSON(utf8(), "[]");
-  ASSERT_OK_AND_ASSIGN(Datum datum_result, DictionaryEncode(values));
+  ASSERT_OK_AND_ASSIGN(Datum datum_result,
+                       DictionaryEncode(values, DictionaryEncodeOptions::Defaults()));
 
   std::shared_ptr<Array> result = datum_result.make_array();
   const auto& dict_result = checked_cast<const DictionaryArray&>(*result);
   ASSERT_OK(dict_result.ValidateFull());
 }
 
+TEST_F(TestHashKernel, NullEncodingSchemes) {
+  auto values = ArrayFromJSON(uint8(), "[1, 1, null, 2, null]");
+
+  // Masking should put null in the indices array
+  auto expected_mask_indices = ArrayFromJSON(int32(), "[0, 0, null, 1, null]");
+  auto expected_mask_dictionary = ArrayFromJSON(uint8(), "[1, 2]");
+  auto dictionary_type = dictionary(int32(), uint8());
+  std::shared_ptr<Array> expected = std::make_shared<DictionaryArray>(
+      dictionary_type, expected_mask_indices, expected_mask_dictionary);
+
+  ASSERT_OK_AND_ASSIGN(Datum datum_result,
+                       DictionaryEncode(values, DictionaryEncodeOptions::Defaults()));
+  std::shared_ptr<Array> result = datum_result.make_array();
+  AssertArraysEqual(*expected, *result);
+
+  // Encoding should put null in the dictionary
+  auto expected_encoded_indices = ArrayFromJSON(int32(), "[0, 0, 1, 2, 1]");
+  auto expected_encoded_dict = ArrayFromJSON(uint8(), "[1, null, 2]");
+  expected = std::make_shared<DictionaryArray>(dictionary_type, expected_encoded_indices,
+                                               expected_encoded_dict);
+
+  auto options = DictionaryEncodeOptions::Defaults();
+  options.null_encoding_behavior = DictionaryEncodeOptions::ENCODE;
+  ASSERT_OK_AND_ASSIGN(datum_result, DictionaryEncode(values, options));
+  result = datum_result.make_array();
+  AssertArraysEqual(*expected, *result);
+}

Review comment:
       It'd be acceptable to just remove `DictionaryEncodeOptions::SKIP` for now; we can add it in a follow up if there's demand

##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       Thanks for working on these, but please revert them here and add them in a dedicated PR




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580559082



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -322,8 +397,9 @@ TEST_F(TestPartitioning, DiscoverHiveSchema) {
 }
 

Review comment:
       There was already one in dictionary inference but I added one in the non-dictionary inference case too.  This also prompted me to consider the case `{"/alpha=_FALLBACK_STRING_"}` which I've changed to be an error instead of returning an empty schema.  This matches directory partitioning.

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -120,6 +125,7 @@ class ARROW_DS_EXPORT KeyValuePartitioning : public Partitioning {
   /// of a scalar value
   struct Key {
     std::string name, value;
+    bool null;

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r581371432



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -77,6 +78,39 @@ class TestPartitioning : public ::testing::Test {
     ASSERT_OK_AND_ASSIGN(partitioning_, factory_->Finish(actual));
   }
 
+  void AssertPartition(const std::shared_ptr<Partitioning> partitioning,
+                       const std::shared_ptr<RecordBatch> full_batch,
+                       const RecordBatchVector& expected_batches,
+                       const std::vector<Expression>& expected_expressions) {
+    ASSERT_OK_AND_ASSIGN(auto partition_results, partitioning->Partition(full_batch));
+    std::shared_ptr<RecordBatch> rest = full_batch;

Review comment:
       Unused:
   ```suggestion
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -147,7 +151,9 @@ Result<Expression> KeyValuePartitioning::ConvertKey(const Key& key) const {
 
   std::shared_ptr<Scalar> converted;
 
-  if (field->type()->id() == Type::DICTIONARY) {
+  if (!key.value.has_value()) {
+    return is_null(field_ref(field->name()));
+  } else if (field->type()->id() == Type::DICTIONARY) {

Review comment:
       nit:
   ```suggestion
     }
   
     if (field->type()->id() == Type::DICTIONARY) {
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_hash_test.cc
##########
@@ -542,6 +547,12 @@ TEST_F(TestHashKernel, UniqueDecimal) {
                                           {true, false, true, true}, expected, {1, 0, 1});
 }
 
+TEST_F(TestHashKernel, UniqueNull) {
+  CheckUnique<NullType, std::nullptr_t>(null(), {nullptr, nullptr}, {false, true},
+                                        {nullptr}, {false});
+  CheckUnique<NullType, std::nullptr_t>(null(), {}, {}, {}, {});

Review comment:
       ```suggestion
     CheckUnique(ArrayFromJSON(null(), "[null, null, null]"), ArrayFromJSON(null(), "[null]"));
     CheckUnique(ArrayFromJSON(null(), "[]"), ArrayFromJSON(null(), "[]"));
   ```

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -282,9 +348,16 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
                     equal(field_ref("alpha"), literal(0))),
                "alpha=0/beta=3.25");
   AssertFormat(equal(field_ref("alpha"), literal(0)), "alpha=0");

Review comment:
       Since a null valued partition key is semantically equivalent to an absent one, we should ensure they format identically. I've created ARROW-11762 for follow up

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,8 +650,27 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
+    if (column.type()->id() == Type::DICTIONARY) {
+      if (column.null_count() != 0) {
+        // TODO(ARROW-11732) Optimize this by allowign DictionaryEncode to transfer a

Review comment:
       ```suggestion
           // TODO(ARROW-11732) Optimize this by allowing DictionaryEncode to transfer a
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -287,8 +303,13 @@ class KeyValuePartitioningFactory : public PartitioningFactory {
     return it_inserted.first->second;
   }
 
-  Status InsertRepr(const std::string& name, util::string_view repr) {
-    return InsertRepr(GetOrInsertField(name), repr);
+  Status InsertRepr(const std::string& name, util::optional<string_view> repr) {
+    auto field_index = GetOrInsertField(name);
+    if (repr.has_value()) {
+      return InsertRepr(field_index, *repr);
+    } else {
+      return Status::OK();
+    }

Review comment:
       ```suggestion
       if (repr.has_value()) {
         auto field_index = GetOrInsertField(name);
         return InsertRepr(field_index, *repr);
       }
       return Status::OK();
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r570419335



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -76,6 +79,8 @@ class UniqueAction final : public ActionBase {
   template <class Index>
   void ObserveNotFound(Index index) {}
 
+  bool ShouldEncodeNulls() { return true; }

Review comment:
       Making this explicitly constexpr (where possible) will help the compiler drop dead code 
   ```suggestion
     constexpr bool ShouldEncodeNulls() const { return true; }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264983



##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -296,7 +315,8 @@ Result<std::shared_ptr<StructArray>> ValueCounts(const Datum& value,
 /// \since 1.0.0
 /// \note API not yet finalized
 ARROW_EXPORT
-Result<Datum> DictionaryEncode(const Datum& data, ExecContext* ctx = NULLPTR);
+Result<Datum> DictionaryEncode(const Datum& data, const DictionaryEncodeOptions& options,

Review comment:
       Done, and cleaned up all the other call sites explicitly stating the default options.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r576476237



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const Expression& expr,
                                                       RecordBatchProjector* projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-    if (!ref_value.second.is_scalar()) {
-      return Status::Invalid("non-scalar partition key ", ref_value.second.ToString());
+    const auto& known_value = ref_value.second;
+    if (known_value.concrete() && !known_value.datum.is_scalar()) {
+      return Status::Invalid("non-scalar partition key ", known_value.datum.ToString());
     }
 
     ARROW_ASSIGN_OR_RAISE(auto match,
                           ref_value.first.FindOneOrNone(*projector->schema()));
 
     if (match.empty()) continue;
-    RETURN_NOT_OK(projector->SetDefaultValue(match, ref_value.second.scalar()));
+
+    const auto& field = projector->schema()->field(match[0]);
+    if (known_value.concrete()) {
+      RETURN_NOT_OK(projector->SetDefaultValue(match, known_value.datum.scalar()));
+    } else if (known_value.valid) {
+      return Status::Invalid(
+          "Partition expression not defined enough to set default value for ",

Review comment:
       Ok, now it just continues in this case and I added a test to verify this behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580663647



##########
File path: cpp/src/arrow/dataset/expression.h
##########
@@ -159,10 +162,27 @@ Expression call(std::string function, std::vector<Expression> arguments,
 ARROW_DS_EXPORT
 std::vector<FieldRef> FieldsInExpression(const Expression&);
 
+/// Represents either a concrete value or a hint that a field is valid/invalid
+struct KnownFieldValue {

Review comment:
       Merged in.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r579306832



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -529,6 +578,7 @@ std::unique_ptr<KernelState> DictionaryHashInit(KernelContext* ctx,
       DCHECK(false) << "Unsupported dictionary index type";
       break;
   }
+  DictionaryEncodeOptions options = DictionaryEncodeOptions::Defaults();

Review comment:
       Seems unused:
   ```suggestion
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -147,6 +152,8 @@ class ValueCountsAction final : ActionBase {
     }
   }
 
+  bool ShouldEncodeNulls() { return true; }

Review comment:
       ```suggestion
     constexpr bool ShouldEncodeNulls() const { return true; }
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,8 +654,27 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
+    if (column.type()->id() == Type::DICTIONARY) {
+      if (column.null_count() != 0) {
+        // TODO Optimize this by allowign DictionaryEncode to transfer a null-masked

Review comment:
       please include a username or (preferably) a JIRA in TODO comments
   ```suggestion
           // TODO($FOLLOW_UP) Optimize this by allowing DictionaryEncode to transfer a null-masked
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -410,21 +434,26 @@ std::shared_ptr<PartitioningFactory> DirectoryPartitioning::MakeFactory(
 }
 
 util::optional<KeyValuePartitioning::Key> HivePartitioning::ParseKey(
-    const std::string& segment) {
+    const std::string& segment, const std::string& null_fallback) {
   auto name_end = string_view(segment).find_first_of('=');
+  // Not round-trippable
   if (name_end == string_view::npos) {
     return util::nullopt;
   }
 
-  return Key{segment.substr(0, name_end), segment.substr(name_end + 1)};
+  auto value = segment.substr(name_end + 1);
+  if (value == null_fallback) {
+    return Key{segment.substr(0, name_end), "", true};
+  }
+  return Key{segment.substr(0, name_end), segment.substr(name_end + 1), false};

Review comment:
       ```suggestion
     auto name = segment.substr(0, name_end);
     auto value = segment.substr(name_end + 1);
     if (value == null_fallback) {
       return Key{name, "", true};
     }
     return Key{name, value, false};
   ```

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -120,6 +125,7 @@ class ARROW_DS_EXPORT KeyValuePartitioning : public Partitioning {
   /// of a scalar value
   struct Key {
     std::string name, value;
+    bool null;

Review comment:
       nit:
   ```suggestion
       std::string name;
       std::string util::optional<value>;
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_hash_test.cc
##########
@@ -305,6 +305,11 @@ TEST_F(TestHashKernel, ValueCountsBoolean) {
                    ArrayFromJSON(boolean(), "[false]"), ArrayFromJSON(int64(), "[2]"));
 }
 
+TEST_F(TestHashKernel, ValueCountsNull) {
+  CheckValueCounts<NullType, std::nullptr_t>(
+      null(), {nullptr, nullptr, nullptr}, {true, false, true}, {nullptr}, {false}, {3});

Review comment:
       For improved readability, please prefer JSON test arrays
   ```suggestion
     CheckValueCounts(ArrayFromJSON(null(), "[null, null, null]"),
                      ArrayFromJSON(null(), "[null]"), ArrayFromJSON(int64(), "[3]"));
   ```
   In particular, note that the from-vector case attempts to construct a null array with only the middle slot null

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -322,8 +397,9 @@ TEST_F(TestPartitioning, DiscoverHiveSchema) {
 }
 

Review comment:
       Please add a test asserting that `{"/alpha=1", "/alpha=_FALLBACK_STRING_"}` is inferred as integer

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -103,6 +137,32 @@ class TestPartitioning : public ::testing::Test {
   std::shared_ptr<Schema> written_schema_;
 };
 
+TEST_F(TestPartitioning, Partition) {
+  auto partition_schema = schema({field("a", int32()), field("b", utf8())});
+  auto schema_ = schema({field("a", int32()), field("b", utf8()), field("c", uint32())});
+  auto remaining_schema = schema({field("c", uint32())});
+  auto partitioning = std::make_shared<DirectoryPartitioning>(partition_schema);
+  std::string json = R"([{"a": 3,    "b": "x",  "c": 0},
+                         {"a": 3,    "b": "x",  "c": 1},
+                         {"a": 1,    "b": null, "c": 2},
+                         {"a": null, "b": null, "c": 3},
+                         {"a": null, "b": "z",  "c": 4},
+                         {"a": null, "b": null, "c": 5}
+                       ])";
+  std::vector<std::string> expected_batches = {R"([{"c": 0}, {"c": 1}])", R"([{"c": 2}])",
+                                               R"([{"c": 3}, {"c": 5}])",
+                                               R"([{"c": 4}])"};
+  std::vector<Expression> expected_expressions = {
+      and_(equal(field_ref("a"), literal(3)), equal(field_ref("b"), literal("x"))),
+      and_(equal(field_ref("a"), literal(1)), is_null(field_ref("b"))),
+      and_(is_null(field_ref("a")), is_null(field_ref("b"))),
+      and_(is_null(field_ref("a")), equal(field_ref("b"), literal("z")))};
+  AssertPartition(partitioning, schema_, json, remaining_schema, expected_batches,
+                  expected_expressions);
+}
+
+TEST_F(TestPartitioning, StructDictionaryNull) {}
+

Review comment:
       ```suggestion
   ```

##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -687,11 +738,14 @@ void RegisterVectorHash(FunctionRegistry* registry) {
   // ----------------------------------------------------------------------
   // dictionary_encode
 
+  const auto kDefaultDictionaryEncodeOptions = DictionaryEncodeOptions::Defaults();

Review comment:
       This needs to have static storage duration; please move it into the anonymous namespace next to `dictionary_encode_doc`

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1580,6 +1584,8 @@ cdef class HivePartitioning(Partitioning):
         corresponding entry of `dictionaries` must be an array containing
         every value which may be taken by the corresponding column or an
         error will be raised in parsing.
+    null_fallback : str

Review comment:
       ```suggestion
       null_fallback : str, default "__HIVE_DEFAULT_PARTITION__"
   ```

##########
File path: cpp/src/arrow/dataset/expression.h
##########
@@ -159,10 +162,27 @@ Expression call(std::string function, std::vector<Expression> arguments,
 ARROW_DS_EXPORT
 std::vector<FieldRef> FieldsInExpression(const Expression&);
 
+/// Represents either a concrete value or a hint that a field is valid/invalid
+struct KnownFieldValue {

Review comment:
       Instead, let's use a null scalar to indicate that a field is known to be null. In particular, this allows us to maintain division of responsibilities between known value replacement and constant folding, and minimizes special cases which need to be recognized and handled in simplified expressions. This is a significant change so I've assembled a PR summarizing it: https://github.com/westonpace/arrow/pull/5

##########
File path: cpp/src/arrow/dataset/expression.h
##########
@@ -135,6 +135,9 @@ inline bool operator!=(const Expression& l, const Expression& r) { return !l.Equ
 ARROW_DS_EXPORT
 Expression literal(Datum lit);
 
+ARROW_DS_EXPORT
+Expression null_literal(const std::shared_ptr<DataType>& type);

Review comment:
       Since this is only used in tests, please move it to `expression_test.cc`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r565396190



##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       What does SKIP mean in this case? Then your resulting encoded array will be shorter as the original array? (dropping any nulls)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r564198742



##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       These changes aren't related to the PR and I can pull them out but I'd like to get them in at some point.  The GDB pretty prints (python scripts) aren't included in this PR but I use them for debugging and they rely on these functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r576476421



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(
+    tempdir, partitioning, null_fallback, partition_keys
+):
     # ARROW-9288 / ARROW-9476
     import pyarrow.parquet as pq
-    table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
+
+    table = pa.table({'a': range(9), 'b': [0.0] * 4 + [1.0] * 5})
+
+    if None in partition_keys[0] or None in partition_keys[1]:
+        # Directory partitioning can't handle the first part being null
+        return

Review comment:
       Good catch, this test had a few other issues as well that I've fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575451640



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const Expression& expr,
                                                       RecordBatchProjector* projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-    if (!ref_value.second.is_scalar()) {
-      return Status::Invalid("non-scalar partition key ", ref_value.second.ToString());
+    const auto& known_value = ref_value.second;
+    if (known_value.concrete() && !known_value.datum.is_scalar()) {
+      return Status::Invalid("non-scalar partition key ", known_value.datum.ToString());
     }
 
     ARROW_ASSIGN_OR_RAISE(auto match,
                           ref_value.first.FindOneOrNone(*projector->schema()));
 
     if (match.empty()) continue;
-    RETURN_NOT_OK(projector->SetDefaultValue(match, ref_value.second.scalar()));
+
+    const auto& field = projector->schema()->field(match[0]);
+    if (known_value.concrete()) {
+      RETURN_NOT_OK(projector->SetDefaultValue(match, known_value.datum.scalar()));
+    } else if (known_value.valid) {
+      return Status::Invalid(
+          "Partition expression not defined enough to set default value for ",

Review comment:
       Today we only get in this state if the expression is something like `pa.field("a").is_valid()` although technically this could be inferred from expressions like `pa.field("a") > 7` as well.  Any expression that only evaluates true if the field is non-null gives us some hint at least that the result will be non-null.
   
   Although, on second review, it's probably better to just `continue` in this case rather than return an error.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580558289



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -687,11 +738,14 @@ void RegisterVectorHash(FunctionRegistry* registry) {
   // ----------------------------------------------------------------------
   // dictionary_encode
 
+  const auto kDefaultDictionaryEncodeOptions = DictionaryEncodeOptions::Defaults();

Review comment:
       Done.

##########
File path: cpp/src/arrow/dataset/expression.h
##########
@@ -135,6 +135,9 @@ inline bool operator!=(const Expression& l, const Expression& r) { return !l.Equ
 ARROW_DS_EXPORT
 Expression literal(Datum lit);
 
+ARROW_DS_EXPORT
+Expression null_literal(const std::shared_ptr<DataType>& type);

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568266959



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
-    if (column.type()->id() != Type::DICTIONARY) {
-      ARROW_ASSIGN_OR_RAISE(column, compute::DictionaryEncode(std::move(column)));
+    if (column.type()->id() == Type::DICTIONARY) {
+      // compute::DictionaryEncode doesn't support dictionary and, even if it did, it
+      // would be a null op and return a flat dictionary.  In order to group by dictionary
+      // we would need to be able to create a nested dictionary.
+      return Status::NotImplemented(
+          "Cannot use column of type dictionary as grouping criteria");

Review comment:
       Yes, I keep thinking of dictionary arrays as arrays of dictionaries [{"a": 1, "b": 2}, ...] and not simply a different encoding.  My brain was off on a tangent when I wrote this column.
   
   I added the logic to encode nulls by decoding (casting) the dictionary first and then re-encoding.  I also added a test case for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575453002



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])

Review comment:
       In that case it does not pass in anything when it creates the partitioning and ensures that it defaults to the correct default value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r570419335



##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -76,6 +79,8 @@ class UniqueAction final : public ActionBase {
   template <class Index>
   void ObserveNotFound(Index index) {}
 
+  bool ShouldEncodeNulls() { return true; }

Review comment:
       Making this explicitly constexpr (where possible) will help the compiler drop dead code 
   ```suggestion
     constexpr bool ShouldEncodeNulls() const { return true; }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264655



##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       Removed.

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       I removed SKIP and added some comments similar to those in filter.

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -296,7 +315,8 @@ Result<std::shared_ptr<StructArray>> ValueCounts(const Datum& value,
 /// \since 1.0.0
 /// \note API not yet finalized
 ARROW_EXPORT
-Result<Datum> DictionaryEncode(const Datum& data, ExecContext* ctx = NULLPTR);
+Result<Datum> DictionaryEncode(const Datum& data, const DictionaryEncodeOptions& options,

Review comment:
       Done, and cleaned up all the other call sites explicitly stating the default options.

##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -451,8 +487,12 @@ struct HashKernelTraits<Type, Action, enable_if_has_string_view<Type>> {
 template <typename Type, typename Action>
 std::unique_ptr<HashKernel> HashInitImpl(KernelContext* ctx, const KernelInitArgs& args) {
   using HashKernelType = typename HashKernelTraits<Type, Action>::HashKernel;
-  auto result = ::arrow::internal::make_unique<HashKernelType>(args.inputs[0].type,
-                                                               ctx->memory_pool());
+  DictionaryEncodeOptions options;
+  if (auto options_ptr = static_cast<const DictionaryEncodeOptions*>(args.options)) {

Review comment:
       I moved the options decoding into Action although the `HashKernel` still needed to be aware of it (to decide if `null` should go in the dictionary or not) so it changed the kernel->action interface slightly (added `ShouldEncodeNulls`)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
-    if (column.type()->id() != Type::DICTIONARY) {
-      ARROW_ASSIGN_OR_RAISE(column, compute::DictionaryEncode(std::move(column)));
+    if (column.type()->id() == Type::DICTIONARY) {
+      // compute::DictionaryEncode doesn't support dictionary and, even if it did, it
+      // would be a null op and return a flat dictionary.  In order to group by dictionary
+      // we would need to be able to create a nested dictionary.
+      return Status::NotImplemented(
+          "Cannot use column of type dictionary as grouping criteria");

Review comment:
       Yes, I keep thinking of dictionary arrays as arrays of dictionaries [{"a": 1, "b": 2}, ...] and not simply a different encoding.  My brain was off on a tangent when I wrote this column.
   
   I added the logic to encode nulls by decoding (casting) the dictionary first and then re-encoding.  I also added a test case for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580559670



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -410,21 +434,26 @@ std::shared_ptr<PartitioningFactory> DirectoryPartitioning::MakeFactory(
 }
 
 util::optional<KeyValuePartitioning::Key> HivePartitioning::ParseKey(
-    const std::string& segment) {
+    const std::string& segment, const std::string& null_fallback) {
   auto name_end = string_view(segment).find_first_of('=');
+  // Not round-trippable
   if (name_end == string_view::npos) {
     return util::nullopt;
   }
 
-  return Key{segment.substr(0, name_end), segment.substr(name_end + 1)};
+  auto value = segment.substr(name_end + 1);
+  if (value == null_fallback) {
+    return Key{segment.substr(0, name_end), "", true};
+  }
+  return Key{segment.substr(0, name_end), segment.substr(name_end + 1), false};

Review comment:
       Made the change (not through git since there were other changes here)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r580559383



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,8 +654,27 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
+    if (column.type()->id() == Type::DICTIONARY) {
+      if (column.null_count() != 0) {
+        // TODO Optimize this by allowign DictionaryEncode to transfer a null-masked

Review comment:
       ARROW-11732




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org