You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/22 16:32:32 UTC
[GitHub] [arrow] pitrou opened a new pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
pitrou opened a new pull request #10782:
URL: https://github.com/apache/arrow/pull/10782
* Allow reading CSV columns as time32 and time64
* Automatically infer "hh:mm" and "hh:mm:ss" as time32[s]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887301435
@westonpace I'd rather wait before someone requests it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-886689802
(rebased)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887016039
This could be a future JIRA or "wait until it's asked for" but technically ISO-8601 allows the omission of the colons if you add a leading T but I've never seen it in practice...
```
Thh
Thhmm
Thhmmss
Thhmmss.sss
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885051096
https://issues.apache.org/jira/browse/ARROW-11243
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199
Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
[Finished :arrow_down:0.0% :arrow_up:1.53%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
[Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676588138
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
Review comment:
I don't think so. Integers can not be parsed as a time field currently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-887301435
@westonpace I'd rather wait for someone to request it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676588696
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+ {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+ auto type = time32(TimeUnit::SECOND);
Review comment:
Currently, there is no inference to time64. Should there be one (for times with nanosecond precision perhaps)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199
Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
[Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
[Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199
Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
[Scheduled] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
[Scheduled] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676585650
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+ {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+ auto type = time32(TimeUnit::SECOND);
Review comment:
Should there also be an inferring test that results in time64?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676596917
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
+ {ArrayFromJSON(time32(TimeUnit::SECOND), "[null, 5025, null]")});
+}
+
+TEST_F(InferringColumnBuilderTest, MultipleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+ auto type = time32(TimeUnit::SECOND);
Review comment:
Ah I see, I was looking at the wrong case statement, you're right. Maybe there should be one to catch any times with fractional seconds, though I don't feel strongly about it since you can explicitly declare the type/schema you want.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r675146941
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
Review comment:
Could you include an integer or date (which should still allow the column to be inferred as time)?
```suggestion
CheckInferred(tg, {{"", "99", "01:23:45", "NA"}}, options,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885080829
@ursabot please benchmark
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r677211270
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
Review comment:
Will do.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#discussion_r676736686
##########
File path: cpp/src/arrow/csv/column_builder_test.cc
##########
@@ -400,6 +400,24 @@ TEST_F(InferringColumnBuilderTest, MultipleChunkDate) {
ArrayFromJSON(date32(), "[null]")});
}
+TEST_F(InferringColumnBuilderTest, SingleChunkTime) {
+ auto options = ConvertOptions::Defaults();
+ auto tg = TaskGroup::MakeSerial();
+
+ CheckInferred(tg, {{"", "01:23:45", "NA"}}, options,
Review comment:
I mean: I'm guessing that in the case of `"", "99", "01:23:45"` we'd fall back to string since we have an integer (definitely not a time, timestamp, ...) and a time (definitely not an integer, float, ...). That seems worth testing
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10782:
URL: https://github.com/apache/arrow/pull/10782
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot edited a comment on pull request #10782: ARROW-11243: [C++] Recognize time types in CSV files
Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10782:
URL: https://github.com/apache/arrow/pull/10782#issuecomment-885081199
Benchmark runs are scheduled for baseline = 169b249057cfe7dd2462efd43f6d1de7e23bb60e and contender = 25b5c1e445f04b53a0beb40afabb3e8012c31f50. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2 (mimalloc)](https://conbench.ursa.dev/compare/runs/bd5b0a0fb1b846aca7025545d099f860...960f1aae0f734b17a2377d7a17bdb144/)
[Finished :arrow_down:0.0% :arrow_up:1.53%] [ursa-i9-9960x (mimalloc)](https://conbench.ursa.dev/compare/runs/13fb9a396ecf4396b077794d29ed8a1c...f4c7a34e715646018a7084849de287a2/)
[Finished :arrow_down:0.91% :arrow_up:0.1%] [ursa-thinkcentre-m75q (mimalloc)](https://conbench.ursa.dev/compare/runs/f7459221ae474079b9926f692c70a5d6...e1b16111fec44c73bb4615987ae457b0/)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org