You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/04 19:49:30 UTC
[GitHub] [arrow] bkietz opened a new pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
bkietz opened a new pull request #7901:
URL: https://github.com/apache/arrow/pull/7901
Replace `StringConverter` with `ParseValueTraits` and emphasize that the function `ParseValue` is the entrypoint for parsing primitive values.
Replace `StringFormatter` with `FormatValueTraits` and add a function `FormatValue` to act as the entrypoint for formatting primitive values.
A number of convenience overloads are provided for both.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #7901:
URL: https://github.com/apache/arrow/pull/7901
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r508855487
##########
File path: cpp/src/arrow/json/converter.cc
##########
@@ -123,13 +123,11 @@ class NumericConverter : public PrimitiveConverter {
RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
auto visit_valid = [&](string_view repr) {
- value_type value;
- if (!internal::ParseValue(numeric_type_, repr.data(), repr.size(), &value)) {
- return GenericConversionError(*out_type_, ", couldn't parse:", repr);
+ if (auto value = internal::ParseValue(numeric_type_, repr)) {
Review comment:
This is nice that you fixed this "maybe-undefined" `value`. There are two more like this in `csv/converter.cc`: can you fix those too? cf. https://github.com/r-windows/rtools-packages/runs/1277507717#step:4:893
(How difficult would it be to add a CI job that fails on `-Wmaybe-undefined`? I feel like every time I inspect the C++ build logs on the R mingw builds (which is infrequent), I find more warnings like this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r510399380
##########
File path: cpp/src/arrow/json/converter.cc
##########
@@ -123,13 +123,11 @@ class NumericConverter : public PrimitiveConverter {
RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
auto visit_valid = [&](string_view repr) {
- value_type value;
- if (!internal::ParseValue(numeric_type_, repr.data(), repr.size(), &value)) {
- return GenericConversionError(*out_type_, ", couldn't parse:", repr);
+ if (auto value = internal::ParseValue(numeric_type_, repr)) {
Review comment:
I think these can be fixed, but I'd like to wait for a follow up so we don't conflate any perf impact of this PR with unrelated changes inside CSV
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465748964
##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
namespace internal {
/// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+/// maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature
Review comment:
I was following the convention of `ParseValueTraits`, which names the conjugate method `Convert`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465734624
##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
namespace internal {
/// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+/// maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature
Review comment:
Why not call it `Format`?
##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
"6061626364656667686970717273747576777879"
"8081828384858687888990919293949596979899";
-} // namespace detail
-
-struct FloatToStringFormatter::Impl {
- Impl()
- : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
- 'e', -6, 10, 6, 0) {}
-
- DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+ DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+ 0);
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
// StringBuilder checks bounds in debug mode for us
- util::double_conversion::StringBuilder builder(out_buffer, out_size);
- bool result = impl_->converter_.ToShortestSingle(v, &builder);
+ util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+ bool result = g_double_converter.ToShortestSingle(v, &builder);
DCHECK(result);
ARROW_UNUSED(result);
return builder.position();
}
-int FloatToStringFormatter::FormatFloat(double v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
- util::double_conversion::StringBuilder builder(out_buffer, out_size);
- bool result = impl_->converter_.ToShortest(v, &builder);
+int FormatFloat(double v, size_t buffer_size, char* buffer) {
+ util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+ bool result = g_double_converter.ToShortest(v, &builder);
DCHECK(result);
ARROW_UNUSED(result);
return builder.position();
}
+} // namespace detail
Review comment:
Why not use the anonymous namespace for private functions?
##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
"6061626364656667686970717273747576777879"
"8081828384858687888990919293949596979899";
-} // namespace detail
-
-struct FloatToStringFormatter::Impl {
- Impl()
- : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
- 'e', -6, 10, 6, 0) {}
-
- DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+ DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+ 0);
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
Review comment:
It's weird to take a `size_t` buffer size and return a `int` for the number of bytes written.
##########
File path: cpp/src/arrow/compute/kernels/scalar_cast_internal.cc
##########
@@ -270,6 +272,30 @@ void AddCommonCasts(Type::type out_type_id, OutputType out_ty, CastFunction* fun
MemAllocation::NO_PREALLOCATE));
}
+// TODO(bkietz): provide a generic from-string parsing cast
Review comment:
TODOs should be kept in JIRA, not sprinkled around source code, IMHO.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465750819
##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
"6061626364656667686970717273747576777879"
"8081828384858687888990919293949596979899";
-} // namespace detail
-
-struct FloatToStringFormatter::Impl {
- Impl()
- : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
- 'e', -6, 10, 6, 0) {}
-
- DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+ DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+ 0);
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
Review comment:
I'll make both `int`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465751971
##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
"6061626364656667686970717273747576777879"
"8081828384858687888990919293949596979899";
-} // namespace detail
-
-struct FloatToStringFormatter::Impl {
- Impl()
- : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
- 'e', -6, 10, 6, 0) {}
-
- DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+ DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+ 0);
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
// StringBuilder checks bounds in debug mode for us
- util::double_conversion::StringBuilder builder(out_buffer, out_size);
- bool result = impl_->converter_.ToShortestSingle(v, &builder);
+ util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+ bool result = g_double_converter.ToShortestSingle(v, &builder);
DCHECK(result);
ARROW_UNUSED(result);
return builder.position();
}
-int FloatToStringFormatter::FormatFloat(double v, char* out_buffer, int out_size) {
- DCHECK_GE(out_size, kMinBufferSize);
- util::double_conversion::StringBuilder builder(out_buffer, out_size);
- bool result = impl_->converter_.ToShortest(v, &builder);
+int FormatFloat(double v, size_t buffer_size, char* buffer) {
+ util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+ bool result = g_double_converter.ToShortest(v, &builder);
DCHECK(result);
ARROW_UNUSED(result);
return builder.position();
}
+} // namespace detail
Review comment:
These aren't quite private; they're exported on line 220 of formatting.h so that they can be used by `FormatValueTraits<floating types>::Convert`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-754108616
closing for now
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-668798574
https://issues.apache.org/jira/browse/ARROW-9543
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-669219582
It seems that micro-benchmark performance is regressing with this PR, at least here (clang 9.0, AMD Ryzen):
* before:
```
IntegerParsing<Int8Type> 2396 ns 2396 ns 292824 items_per_second=417.353M/s
IntegerParsing<Int16Type> 4499 ns 4499 ns 155249 items_per_second=222.277M/s
IntegerParsing<Int32Type> 8664 ns 8663 ns 80584 items_per_second=115.43M/s
IntegerParsing<Int64Type> 12173 ns 12172 ns 57294 items_per_second=82.158M/s
IntegerParsing<UInt8Type> 3369 ns 3369 ns 207573 items_per_second=296.845M/s
IntegerParsing<UInt16Type> 2102 ns 2102 ns 332411 items_per_second=475.832M/s
IntegerParsing<UInt32Type> 4962 ns 4961 ns 140831 items_per_second=201.572M/s
IntegerParsing<UInt64Type> 7191 ns 7190 ns 100415 items_per_second=139.078M/s
IntegerFormatting<Int8Type> 3812 ns 3812 ns 179090 items_per_second=262.339M/s
IntegerFormatting<Int16Type> 5027 ns 5027 ns 134540 items_per_second=198.932M/s
IntegerFormatting<Int32Type> 7389 ns 7388 ns 93547 items_per_second=135.355M/s
IntegerFormatting<Int64Type> 15756 ns 15754 ns 44352 items_per_second=63.4756M/s
IntegerFormatting<UInt8Type> 4056 ns 4056 ns 173893 items_per_second=246.56M/s
IntegerFormatting<UInt16Type> 5275 ns 5275 ns 132502 items_per_second=189.586M/s
IntegerFormatting<UInt32Type> 7541 ns 7540 ns 92745 items_per_second=132.63M/s
IntegerFormatting<UInt64Type> 17504 ns 17502 ns 40788 items_per_second=57.1355M/s
```
* after:
```
IntegerParsing<Int8Type> 4084 ns 4083 ns 169535 items_per_second=244.921M/s
IntegerParsing<Int16Type> 4707 ns 4707 ns 148239 items_per_second=212.468M/s
IntegerParsing<Int32Type> 8991 ns 8990 ns 77470 items_per_second=111.238M/s
IntegerParsing<Int64Type> 12632 ns 12630 ns 55277 items_per_second=79.1756M/s
IntegerParsing<UInt8Type> 3475 ns 3475 ns 201119 items_per_second=287.803M/s
IntegerParsing<UInt16Type> 3784 ns 3784 ns 184693 items_per_second=264.29M/s
IntegerParsing<UInt32Type> 5609 ns 5608 ns 124533 items_per_second=178.323M/s
IntegerParsing<UInt64Type> 7552 ns 7550 ns 92385 items_per_second=132.442M/s
IntegerFormatting<Int8Type> 5973 ns 5972 ns 115765 items_per_second=167.445M/s
IntegerFormatting<Int16Type> 7325 ns 7324 ns 95107 items_per_second=136.541M/s
IntegerFormatting<Int32Type> 15128 ns 15126 ns 46259 items_per_second=66.1119M/s
IntegerFormatting<Int64Type> 18152 ns 18149 ns 43116 items_per_second=55.0988M/s
IntegerFormatting<UInt8Type> 6079 ns 6078 ns 114820 items_per_second=164.531M/s
IntegerFormatting<UInt16Type> 8158 ns 8157 ns 85571 items_per_second=122.597M/s
IntegerFormatting<UInt32Type> 10500 ns 10498 ns 66770 items_per_second=95.2558M/s
IntegerFormatting<UInt64Type> 20176 ns 20173 ns 34577 items_per_second=49.5716M/s
```
However, CSV conversion benchmarks seem either affected or (slightly) improved:
* before:
```
Int64Conversion 84572 ns 84558 ns 8320 items_per_second=98.5477M/s
FloatConversion 186815 ns 186786 ns 3770 items_per_second=42.8298M/s
Decimal128Conversion 403464 ns 403407 ns 1715 items_per_second=16.5267M/s
StringConversion 58231 ns 58222 ns 12144 items_per_second=114.509M/s
TimestampConversionDefault 156438 ns 156414 ns 4461 items_per_second=63.9328M/s
TimestampConversionStrptime 472131 ns 472065 ns 1480 items_per_second=21.1835M/s
```
* after:
```
Int64Conversion 70332 ns 70322 ns 9953 items_per_second=118.497M/s
FloatConversion 179511 ns 179482 ns 3823 items_per_second=44.5726M/s
Decimal128Conversion 417192 ns 417133 ns 1678 items_per_second=15.9829M/s
StringConversion 58438 ns 58426 ns 12043 items_per_second=114.109M/s
TimestampConversionDefault 154926 ns 154902 ns 4495 items_per_second=64.5568M/s
TimestampConversionStrptime 514370 ns 514289 ns 1345 items_per_second=19.4443M/s
```
So there doesn't seem to be any reason for concern, a priori.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465753604
##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
namespace internal {
/// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+/// maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature
Review comment:
I see, fair enough.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org