You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "agoncharuk (via GitHub)" <gi...@apache.org> on 2023/05/31 13:14:37 UTC

[GitHub] [arrow] agoncharuk opened a new issue, #35843: [C++] Acero produces unexpected output type for add function on decimals

agoncharuk opened a new issue, #35843:
URL: https://github.com/apache/arrow/issues/35843

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   According to compute functions documentation, the `add` function result precision and scale are computed as
   ```
   scale = max(s1, s2)
   precision = max(p1-s1, p2-s2) + 1 + scale
   ```
   For some reason, this inference produces an incorrect result for `decimal(10, 1)` and `decimal(10, 2)` types when using acero streaming engine, namely:
   ```
   add(decimal(10, 1), decimal(10, 2)) -> decimal(11, 1); // Unexpected
   add(decimal(10, 2), decimal(10, 1)) -> decimal(12, 2); // Ok
   ```
   
   Full test:
   ```
       auto t0 = std::static_pointer_cast<arrow::DecimalType>(*arrow::DecimalType::Make(arrow::Type::DECIMAL128, 10, 1));
       auto t1 = std::static_pointer_cast<arrow::DecimalType>(*arrow::DecimalType::Make(arrow::Type::DECIMAL128, 10, 2));
   
       arrow::Decimal128Builder b0(t0);
       arrow::Decimal128Builder b1(t1);
   
       for (int i = 0; i < 10; ++i) {
           EXPECT_TRUE(b0.Append(i * 10).ok());
           EXPECT_TRUE(b1.Append((i + 1) * 100).ok());
       }
   
       auto batch = arrow::compute::ExecBatch::Make({*b0.Finish(), *b1.Finish()}, 10).ValueOrDie();
       auto gen = arrow::MakeVectorGenerator(std::vector{std::make_optional(std::move(batch))});
       auto schema = arrow::schema({arrow::field("c0", t0), arrow::field("c1", t1)});
   
       arrow::acero::Declaration source{"source", arrow::acero::SourceNodeOptions{schema, std::move(gen)}};
   
       arrow::compute::Expression p0 = cp::call("add", {arrow::compute::field_ref("c0"), arrow::compute::field_ref("c1")});
       arrow::compute::Expression p1 = cp::call("add", {arrow::compute::field_ref("c1"), arrow::compute::field_ref("c0")});
   
       arrow::acero::Declaration project{
         "project", {std::move(source)}, arrow::acero::ProjectNodeOptions({p0, p1})};
   
       auto responseTable = arrow::acero::DeclarationToTable(std::move(project)).ValueOrDie();
       auto outSchema = responseTable->schema();
   
       auto scale = std::max(t0->scale(), t1->scale());
       auto precision = std::max(t0->precision() - t0->scale(), t1->precision() - t1->scale()) + 1 + scale;
       auto expectedOutType = *arrow::DecimalType::Make(arrow::Type::DECIMAL128, precision, scale);
   
       for (int c = 0; c < outSchema->num_fields(); ++c) {
           EXPECT_TRUE(expectedOutType->Equals(outSchema->field(c)->type())) << "Failed for " << c << ": " << outSchema->field(c)->type()->ToString();
       }
   ```
   
   If I call `arrow::compute::Add()` directly on `Datum`s, the expected types are correct regardless of the order.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35843: [C++] Acero produces unexpected output type for add function on decimals

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35843:
URL: https://github.com/apache/arrow/issues/35843#issuecomment-1583575040

   Also, as an idle unrelated note, I've noticed that casting decimals seems to occupy a fair amount of our time in benchmarks.  It's possible this is correct (e.g. maybe this is just a RAM cost being paid for our first traversal through the actual data) but I wonder if combining the cast and the math operation in this case is something that would be possible and save time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35843: [C++] Acero produces unexpected output type for add function on decimals

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35843:
URL: https://github.com/apache/arrow/issues/35843#issuecomment-1583567633

   Good catch.
   
   This promotion is done in two parts, a cast, and the function itself.  For your example we should first be implicitly casting both inputs to `decimal(10, 2)` and then we should apply the function (which is where the precision gets increased).  So we have:
   
   ```
   decimal(10, 1) --> CAST decimal(10, 2)----ADD---decimal(12, 2)
   decimal(10, 2) -------> decimal(10, 2) --/
   ```
   
   When calling functions outside Acero (e.g. in pyarrow this would be what happens on `pyarrow.compute.add`) we call a method named `DispatchBest` on the function object to choose the most appropriate kernel with each invocation.  The default behavior is to call `DispatchExact` and fail otherwise.  However, some functions (e.g. most arithmetic functions) will override `DispatchBest` and attempt to do some implicit casting.  This is where the cast is inserted.
   
   Inside Acero, we don't want to have to go through the logic to pick the correct kernel again and again on every single batch.  So, instead, we "bind the expression" to the input schema.  Part of this binding process is picking the correct function kernel to execute.  However, it seems the logic here is a little different:
   
   ```
     // First try and bind exactly
     Result<const Kernel*> maybe_exact_match = call.function->DispatchExact(types);
     if (maybe_exact_match.ok()) {
       call.kernel = *maybe_exact_match;
     } else {
       if (!insert_implicit_casts) {
         return maybe_exact_match.status();
       }
       // If exact binding fails, and we are allowed to cast, then prefer casting literals
       // first.  Since DispatchBest generally prefers up-casting the best way to do this is
       // first down-cast the literals as much as possible
       types = GetTypesWithSmallestLiteralRepresentation(call.arguments);
       ARROW_ASSIGN_OR_RAISE(call.kernel, call.function->DispatchBest(&types));
   ```
   
   In other words, we start with `DispatchExact` and only call `DispatchBest` if we can't find an exact match.
   
   The problem you are encountering is that `DispatchExact` is recording an exact match.  To understand why that happens we have to look at how we decide if kernels are a match.  To do that we register an input matcher for each of the arguments.  The input matcher might be "This only matches if the type is exactly X" but that doesn't work so well for parameterized types (we'd have to register a kernel for every single instance of decimal128, etc.)
   
   So, for parameterized types, we typically register a "type matcher" of some kind.  It appears that the type matcher for decimal arithmetic is "this input matches if the type is a decimal type, regardless of precision or scale".  So that's how we end up thinking this is an exact match.
   
   A workaround for now would be to do the upcasting explicitly, yourself, but that certainly isn't convenient.
   
   Probably, what would be ideal, is if the type matcher was "match if this is a decimal type and the other type is exactly the same" but the type matcher interface does not currently see all of the types (just the one type it is considering).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org