You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/22 16:32:32 UTC

[GitHub] [arrow] pitrou opened a new pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

pitrou opened a new pull request #10782:
URL: https://github.com/apache/arrow/pull/10782


   * Allow reading CSV columns as time32 and time64
   * Automatically infer "hh:mm" and "hh:mm:ss" as time32[s]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887301435


   @westonpace I'd rather wait before someone requests it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-886689802


   (rebased)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887016039


   This could be a future JIRA or "wait until it's asked for" but technically ISO-8601 allows the omission of the colons if you add a leading T but I've never seen it in practice...
   
   ```
   Thh
   Thhmm
   Thhmmss
   Thhmmss.sss
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885051096


   https://issues.apache.org/jira/browse/ARROW-11243


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199


   Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
   [Finished :arrow_down:0.0% :arrow_up:1.53%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
   [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676588138



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,

Review comment:
       I don't think so. Integers can not be parsed as a time field currently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887301435


   @westonpace I'd rather wait for someone to request it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676588696



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+                {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+  auto type = time32(TimeUnit::SECOND);

Review comment:
       Currently, there is no inference to time64. Should there be one (for times with nanosecond precision perhaps)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199


   Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
   [Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
   [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199


   Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
   [Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
   [Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676585650



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+                {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+  auto type = time32(TimeUnit::SECOND);

Review comment:
       Should there also be an inferring test that results in time64?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676596917



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+                {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+  auto type = time32(TimeUnit::SECOND);

Review comment:
       Ah I see, I was looking at the wrong case statement, you're right. Maybe there should be one to catch any times with fractional seconds, though I don't feel strongly about it since you can explicitly declare the type/schema you want. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r675146941



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,

Review comment:
       Could you include an integer or date (which should still allow the column to be inferred as time)?
   ```suggestion
     CheckInferred(tg, {{"", "99", "01:23:45", "NA"}}, options,
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885080829


   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r677211270



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,

Review comment:
       Will do.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676736686



##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
                  ArrayFromJSON(date32(), "[null]")});
 }
 
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+  auto options = ConvertOptions::Defaults();
+  auto tg = TaskGroup::MakeSerial();
+
+  CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,

Review comment:
       I mean: I'm guessing that in the case of `"", "99", "01:23:45"` we'd fall back to string since we have an integer (definitely not a time, timestamp, ...) and a time (definitely not an integer, float, ...). That seems worth testing




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10782:
URL: https://github.com/apache/arrow/pull/10782


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199


   Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
   [Finished :arrow_down:0.0% :arrow_up:1.53%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
   [Finished :arrow_down:0.91% :arrow_up:0.1%] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org