You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/19 00:49:11 UTC

[GitHub] [arrow] iChauster opened a new pull request, #13644: [ARROW-17117]: [C++] Add timestamp column datatype support for AsOfJoin

iChauster opened a new pull request, #13644:
URL: https://github.com/apache/arrow/pull/13644

   Previously, AsOfJoin only supported `int64` as it's `on` column. Here, we add support for `arrow::timestamp(TimeUnit::NANO, "UTC")`.
   
   If there is a more generalized way of dealing with particular timestamps, please let me know!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1189322585

   @iChauster You are missing test for this change. Please add test in asof_join_test file. I think it's ok to only allow "timestamp" type instead of supporting both "long" and "timestamp"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924763702


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -630,24 +650,28 @@ class AsofJoinNode : public ExecNode {
       for (int i = 0; i < input_schema->num_fields(); ++i) {
         const auto field = input_schema->field(i);
         if (field->name() == on_field_name) {
-          if (kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
-            return Status::Invalid("Unsupported type for on key: ", field->name());
+          // Equals must be used when checking for timestamp types.
+          if (!field->type()->Equals(TimestampType(TimeUnit::NANO, "UTC")) &&

Review Comment:
   I tried this initially, but I don't think `.find(field->type())` works as intended with the timestamp fields. Per my comment, I added an additional check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1205713099

   Let's close this PR. Will pick up and do a new one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924719592


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -630,24 +650,28 @@ class AsofJoinNode : public ExecNode {
       for (int i = 0; i < input_schema->num_fields(); ++i) {
         const auto field = input_schema->field(i);
         if (field->name() == on_field_name) {
-          if (kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
-            return Status::Invalid("Unsupported type for on key: ", field->name());
+          // Equals must be used when checking for timestamp types.
+          if (!field->type()->Equals(TimestampType(TimeUnit::NANO, "UTC")) &&
+              kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
+            return Status::Invalid("Unsupported type for on key: ",
+                                   field->type()->name());
           }
           // Only add on field from the left table
           if (j == 0) {
             fields.push_back(field);
           }
         } else if (field->name() == by_field_name) {
           if (kSupportedByTypes_.find(field->type()) == kSupportedByTypes_.end()) {
-            return Status::Invalid("Unsupported type for by key: ", field->name());
+            return Status::Invalid("Unsupported type for by key: ",
+                                   field->type()->name());

Review Comment:
   Good catch. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1191737981

   @icexelloss I adjusted DoRunBasicTest to run with the same data, but with the timestamp as the time column data type in the schema. Is there anything additional that might interesting to test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924718855


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -630,24 +650,28 @@ class AsofJoinNode : public ExecNode {
       for (int i = 0; i < input_schema->num_fields(); ++i) {
         const auto field = input_schema->field(i);
         if (field->name() == on_field_name) {
-          if (kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
-            return Status::Invalid("Unsupported type for on key: ", field->name());
+          // Equals must be used when checking for timestamp types.
+          if (!field->type()->Equals(TimestampType(TimeUnit::NANO, "UTC")) &&

Review Comment:
   Can you put TimestampType in kSupportedOnTypes_



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924766062


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -630,24 +650,28 @@ class AsofJoinNode : public ExecNode {
       for (int i = 0; i < input_schema->num_fields(); ++i) {
         const auto field = input_schema->field(i);
         if (field->name() == on_field_name) {
-          if (kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
-            return Status::Invalid("Unsupported type for on key: ", field->name());
+          // Equals must be used when checking for timestamp types.
+          if (!field->type()->Equals(TimestampType(TimeUnit::NANO, "UTC")) &&

Review Comment:
   e.g, even though I have added `TimestampType` to `kSupportedOnTypes_`, it has no effect. I looked for a more generic way to detect a `TimestampType`, like `arrow::istimestamp_type` in `src/arrow/type_traits.h` but I'm not sure how it works with the templating.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1189353308

   @icexelloss , yeah I was hoping it would be easier. I think we would really only need a way to tell if a data type is a timestamp at runtime, and we should be able to supply this information to `TimestampBuilder`.
   
   Worst case scenario, we can enumerate over all possible timestamps (one for each time unit and timezone we'd like to support), but I don't think this is a good solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1188593663

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924771462


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -460,10 +464,9 @@ class CompositeReferenceTable {
   }
 
   template <class Builder, class PrimitiveType>
-  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(MemoryPool* memory_pool,
+  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(Builder& builder,

Review Comment:
   Was fighting with the compiler for a bit on this: 
   `TimestampBuilder` does not seem to be able to initialized with just a `memory_pool`, I think it requires a `arrow::DataType` object as well. That is the only difference that requires this function to be refactored and split into `MaterializePrimitveColumn` and `MaterializeTimestampPrimitiveColumn`.
   Everything downstream of the builder initialization is the same -- let me know if there is a better way to handle this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13644: [ARROW-17117]: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1188477891

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924766062


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -630,24 +650,28 @@ class AsofJoinNode : public ExecNode {
       for (int i = 0; i < input_schema->num_fields(); ++i) {
         const auto field = input_schema->field(i);
         if (field->name() == on_field_name) {
-          if (kSupportedOnTypes_.find(field->type()) == kSupportedOnTypes_.end()) {
-            return Status::Invalid("Unsupported type for on key: ", field->name());
+          // Equals must be used when checking for timestamp types.
+          if (!field->type()->Equals(TimestampType(TimeUnit::NANO, "UTC")) &&

Review Comment:
   e.g, even though I have added `TimestampType` to `kSupportedOnTypes_`, it has no effect, and is not found in the array. I looked for a more generic way to detect a `TimestampType`, like `arrow::istimestamp_type` in `src/arrow/type_traits.h` but I'm not sure how it works with the templating.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924771462


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -460,10 +464,9 @@ class CompositeReferenceTable {
   }
 
   template <class Builder, class PrimitiveType>
-  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(MemoryPool* memory_pool,
+  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(Builder& builder,

Review Comment:
   Was fighting with the compiler for a bit on this: 
   `TimestampBuilder` does not seem to be able to initialized with just a `memory_pool`, I think it requires a `arrow::DataType` object as well. That is the only difference that requires this function to be refactored and split into `MaterializePrimitveColumn` and `MaterializeTimestampPrimitiveColumn`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1188593651

   https://issues.apache.org/jira/browse/ARROW-17117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1189303717

   How hard it is to support all timestamp with other timezone too? I think the raw values under the hood is the same regardless the timezone so I'd imagine it shouldn't hard?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
rok commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1206217862

   > Do we want to do the enumeration, or is supporting this one timestamp (Nanosecond, UTC) okay for now?
   
   It would probably be best to support temporal, integer and perhaps(?) float types.
   
   > Let's close this PR since this is the last week of Ivan's internship. Will pick this up and open a new one.
   
   Nice work @iChauster!
   Feel free to ping me on the new PR @icexelloss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss closed pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss closed pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin
URL: https://github.com/apache/arrow/pull/13644


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924732755


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -478,6 +481,23 @@ class CompositeReferenceTable {
     ARROW_RETURN_NOT_OK(builder.Finish(&result));
     return result;
   }
+
+  template <class Builder, class PrimitiveType>
+  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(MemoryPool* memory_pool,
+                                                            size_t i_table,
+                                                            col_index_t i_col) {
+    Builder builder(memory_pool);
+    return MaterializePrimitiveColumn<Builder, PrimitiveType>(builder, i_table, i_col);
+  }
+
+  template <class PrimitiveType>
+  Result<std::shared_ptr<Array>> MaterializeTimestampPrimitiveColumn(
+      MemoryPool* memory_pool, size_t i_table, col_index_t i_col) {
+    TimestampBuilder builder =
+        TimestampBuilder(timestamp(TimeUnit::NANO, "UTC"), memory_pool);

Review Comment:
   Can you not hardcode the timezone here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1189459886

   Ok supporting UTC timestamp is reasonable (and improvement over support long type)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
icexelloss commented on code in PR #13644:
URL: https://github.com/apache/arrow/pull/13644#discussion_r924717572


##########
cpp/src/arrow/compute/exec/asof_join_node.cc:
##########
@@ -460,10 +464,9 @@ class CompositeReferenceTable {
   }
 
   template <class Builder, class PrimitiveType>
-  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(MemoryPool* memory_pool,
+  Result<std::shared_ptr<Array>> MaterializePrimitiveColumn(Builder& builder,

Review Comment:
   What is this change for?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] iChauster commented on pull request #13644: ARROW-17117: [C++] Add timestamp column datatype support for AsOfJoin

Posted by GitBox <gi...@apache.org>.
iChauster commented on PR #13644:
URL: https://github.com/apache/arrow/pull/13644#issuecomment-1191617385

   Do we want to do the enumeration, or is supporting this one timestamp (Nanosecond, UTC) okay for now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org