You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/04 19:49:30 UTC

[GitHub] [arrow] bkietz opened a new pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

bkietz opened a new pull request #7901:
URL: https://github.com/apache/arrow/pull/7901


   Replace `StringConverter` with `ParseValueTraits` and emphasize that the function `ParseValue` is the entrypoint for parsing primitive values.
   
   Replace `StringFormatter` with `FormatValueTraits` and add a function `FormatValue` to act as the entrypoint for formatting primitive values.
   
   A number of convenience overloads are provided for both.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #7901:
URL: https://github.com/apache/arrow/pull/7901


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r508855487



##########
File path: cpp/src/arrow/json/converter.cc
##########
@@ -123,13 +123,11 @@ class NumericConverter : public PrimitiveConverter {
     RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
 
     auto visit_valid = [&](string_view repr) {
-      value_type value;
-      if (!internal::ParseValue(numeric_type_, repr.data(), repr.size(), &value)) {
-        return GenericConversionError(*out_type_, ", couldn't parse:", repr);
+      if (auto value = internal::ParseValue(numeric_type_, repr)) {

Review comment:
       This is nice that you fixed this "maybe-undefined" `value`. There are two more like this in `csv/converter.cc`: can you fix those too? cf. https://github.com/r-windows/rtools-packages/runs/1277507717#step:4:893
   
   (How difficult would it be to add a CI job that fails on `-Wmaybe-undefined`? I feel like every time I inspect the C++ build logs on the R mingw builds (which is infrequent), I find more warnings like this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r510399380



##########
File path: cpp/src/arrow/json/converter.cc
##########
@@ -123,13 +123,11 @@ class NumericConverter : public PrimitiveConverter {
     RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
 
     auto visit_valid = [&](string_view repr) {
-      value_type value;
-      if (!internal::ParseValue(numeric_type_, repr.data(), repr.size(), &value)) {
-        return GenericConversionError(*out_type_, ", couldn't parse:", repr);
+      if (auto value = internal::ParseValue(numeric_type_, repr)) {

Review comment:
       I think these can be fixed, but I'd like to wait for a follow up so we don't conflate any perf impact of this PR with unrelated changes inside CSV




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465748964



##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
 namespace internal {
 
 /// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+///   maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature

Review comment:
       I was following the convention of `ParseValueTraits`, which names the conjugate method `Convert`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465734624



##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
 namespace internal {
 
 /// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+///   maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature

Review comment:
       Why not call it `Format`?

##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
     "6061626364656667686970717273747576777879"
     "8081828384858687888990919293949596979899";
 
-}  // namespace detail
-
-struct FloatToStringFormatter::Impl {
-  Impl()
-      : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
-                   'e', -6, 10, 6, 0) {}
-
-  DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
 
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+    DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+    0);
 
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
   // StringBuilder checks bounds in debug mode for us
-  util::double_conversion::StringBuilder builder(out_buffer, out_size);
-  bool result = impl_->converter_.ToShortestSingle(v, &builder);
+  util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+  bool result = g_double_converter.ToShortestSingle(v, &builder);
   DCHECK(result);
   ARROW_UNUSED(result);
   return builder.position();
 }
 
-int FloatToStringFormatter::FormatFloat(double v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
-  util::double_conversion::StringBuilder builder(out_buffer, out_size);
-  bool result = impl_->converter_.ToShortest(v, &builder);
+int FormatFloat(double v, size_t buffer_size, char* buffer) {
+  util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+  bool result = g_double_converter.ToShortest(v, &builder);
   DCHECK(result);
   ARROW_UNUSED(result);
   return builder.position();
 }
 
+}  // namespace detail

Review comment:
       Why not use the anonymous namespace for private functions?

##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
     "6061626364656667686970717273747576777879"
     "8081828384858687888990919293949596979899";
 
-}  // namespace detail
-
-struct FloatToStringFormatter::Impl {
-  Impl()
-      : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
-                   'e', -6, 10, 6, 0) {}
-
-  DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
 
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+    DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+    0);
 
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {

Review comment:
       It's weird to take a `size_t` buffer size and return a `int` for the number of bytes written.

##########
File path: cpp/src/arrow/compute/kernels/scalar_cast_internal.cc
##########
@@ -270,6 +272,30 @@ void AddCommonCasts(Type::type out_type_id, OutputType out_ty, CastFunction* fun
                             MemAllocation::NO_PREALLOCATE));
 }
 
+// TODO(bkietz): provide a generic from-string parsing cast

Review comment:
       TODOs should be kept in JIRA, not sprinkled around source code, IMHO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465750819



##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
     "6061626364656667686970717273747576777879"
     "8081828384858687888990919293949596979899";
 
-}  // namespace detail
-
-struct FloatToStringFormatter::Impl {
-  Impl()
-      : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
-                   'e', -6, 10, 6, 0) {}
-
-  DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
 
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+    DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+    0);
 
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {

Review comment:
       I'll make both `int`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465751971



##########
File path: cpp/src/arrow/util/formatting.cc
##########
@@ -36,38 +31,29 @@ const char digit_pairs[] =
     "6061626364656667686970717273747576777879"
     "8081828384858687888990919293949596979899";
 
-}  // namespace detail
-
-struct FloatToStringFormatter::Impl {
-  Impl()
-      : converter_(DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan",
-                   'e', -6, 10, 6, 0) {}
-
-  DoubleToStringConverter converter_;
-};
-
-FloatToStringFormatter::FloatToStringFormatter() : impl_(new Impl()) {}
+using util::double_conversion::DoubleToStringConverter;
 
-FloatToStringFormatter::~FloatToStringFormatter() {}
+static DoubleToStringConverter g_double_converter(
+    DoubleToStringConverter::EMIT_POSITIVE_EXPONENT_SIGN, "inf", "nan", 'e', -6, 10, 6,
+    0);
 
-int FloatToStringFormatter::FormatFloat(float v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
+int FormatFloat(float v, size_t buffer_size, char* buffer) {
   // StringBuilder checks bounds in debug mode for us
-  util::double_conversion::StringBuilder builder(out_buffer, out_size);
-  bool result = impl_->converter_.ToShortestSingle(v, &builder);
+  util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+  bool result = g_double_converter.ToShortestSingle(v, &builder);
   DCHECK(result);
   ARROW_UNUSED(result);
   return builder.position();
 }
 
-int FloatToStringFormatter::FormatFloat(double v, char* out_buffer, int out_size) {
-  DCHECK_GE(out_size, kMinBufferSize);
-  util::double_conversion::StringBuilder builder(out_buffer, out_size);
-  bool result = impl_->converter_.ToShortest(v, &builder);
+int FormatFloat(double v, size_t buffer_size, char* buffer) {
+  util::double_conversion::StringBuilder builder(buffer, static_cast<int>(buffer_size));
+  bool result = g_double_converter.ToShortest(v, &builder);
   DCHECK(result);
   ARROW_UNUSED(result);
   return builder.position();
 }
 
+}  // namespace detail

Review comment:
       These aren't quite private; they're exported on line 220 of formatting.h so that they can be used by `FormatValueTraits<floating types>::Convert`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-754108616


   closing for now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-668798574


   https://issues.apache.org/jira/browse/ARROW-9543


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#issuecomment-669219582


   It seems that micro-benchmark performance is regressing with this PR, at least here (clang 9.0, AMD Ryzen):
   * before:
   ```
   IntegerParsing<Int8Type>                        2396 ns         2396 ns       292824 items_per_second=417.353M/s
   IntegerParsing<Int16Type>                       4499 ns         4499 ns       155249 items_per_second=222.277M/s
   IntegerParsing<Int32Type>                       8664 ns         8663 ns        80584 items_per_second=115.43M/s
   IntegerParsing<Int64Type>                      12173 ns        12172 ns        57294 items_per_second=82.158M/s
   IntegerParsing<UInt8Type>                       3369 ns         3369 ns       207573 items_per_second=296.845M/s
   IntegerParsing<UInt16Type>                      2102 ns         2102 ns       332411 items_per_second=475.832M/s
   IntegerParsing<UInt32Type>                      4962 ns         4961 ns       140831 items_per_second=201.572M/s
   IntegerParsing<UInt64Type>                      7191 ns         7190 ns       100415 items_per_second=139.078M/s
   
   IntegerFormatting<Int8Type>                     3812 ns         3812 ns       179090 items_per_second=262.339M/s
   IntegerFormatting<Int16Type>                    5027 ns         5027 ns       134540 items_per_second=198.932M/s
   IntegerFormatting<Int32Type>                    7389 ns         7388 ns        93547 items_per_second=135.355M/s
   IntegerFormatting<Int64Type>                   15756 ns        15754 ns        44352 items_per_second=63.4756M/s
   IntegerFormatting<UInt8Type>                    4056 ns         4056 ns       173893 items_per_second=246.56M/s
   IntegerFormatting<UInt16Type>                   5275 ns         5275 ns       132502 items_per_second=189.586M/s
   IntegerFormatting<UInt32Type>                   7541 ns         7540 ns        92745 items_per_second=132.63M/s
   IntegerFormatting<UInt64Type>                  17504 ns        17502 ns        40788 items_per_second=57.1355M/s
   ```
   * after:
   ```
   IntegerParsing<Int8Type>                        4084 ns         4083 ns       169535 items_per_second=244.921M/s
   IntegerParsing<Int16Type>                       4707 ns         4707 ns       148239 items_per_second=212.468M/s
   IntegerParsing<Int32Type>                       8991 ns         8990 ns        77470 items_per_second=111.238M/s
   IntegerParsing<Int64Type>                      12632 ns        12630 ns        55277 items_per_second=79.1756M/s
   IntegerParsing<UInt8Type>                       3475 ns         3475 ns       201119 items_per_second=287.803M/s
   IntegerParsing<UInt16Type>                      3784 ns         3784 ns       184693 items_per_second=264.29M/s
   IntegerParsing<UInt32Type>                      5609 ns         5608 ns       124533 items_per_second=178.323M/s
   IntegerParsing<UInt64Type>                      7552 ns         7550 ns        92385 items_per_second=132.442M/s
   
   IntegerFormatting<Int8Type>                     5973 ns         5972 ns       115765 items_per_second=167.445M/s
   IntegerFormatting<Int16Type>                    7325 ns         7324 ns        95107 items_per_second=136.541M/s
   IntegerFormatting<Int32Type>                   15128 ns        15126 ns        46259 items_per_second=66.1119M/s
   IntegerFormatting<Int64Type>                   18152 ns        18149 ns        43116 items_per_second=55.0988M/s
   IntegerFormatting<UInt8Type>                    6079 ns         6078 ns       114820 items_per_second=164.531M/s
   IntegerFormatting<UInt16Type>                   8158 ns         8157 ns        85571 items_per_second=122.597M/s
   IntegerFormatting<UInt32Type>                  10500 ns        10498 ns        66770 items_per_second=95.2558M/s
   IntegerFormatting<UInt64Type>                  20176 ns        20173 ns        34577 items_per_second=49.5716M/s
   ```
   
   However, CSV conversion benchmarks seem either affected or (slightly) improved:
   * before:
   ```
   Int64Conversion                  84572 ns        84558 ns         8320 items_per_second=98.5477M/s
   FloatConversion                 186815 ns       186786 ns         3770 items_per_second=42.8298M/s
   Decimal128Conversion            403464 ns       403407 ns         1715 items_per_second=16.5267M/s
   StringConversion                 58231 ns        58222 ns        12144 items_per_second=114.509M/s
   TimestampConversionDefault      156438 ns       156414 ns         4461 items_per_second=63.9328M/s
   TimestampConversionStrptime     472131 ns       472065 ns         1480 items_per_second=21.1835M/s
   ```
   * after:
   ```
   Int64Conversion                  70332 ns        70322 ns         9953 items_per_second=118.497M/s
   FloatConversion                 179511 ns       179482 ns         3823 items_per_second=44.5726M/s
   Decimal128Conversion            417192 ns       417133 ns         1678 items_per_second=15.9829M/s
   StringConversion                 58438 ns        58426 ns        12043 items_per_second=114.109M/s
   TimestampConversionDefault      154926 ns       154902 ns         4495 items_per_second=64.5568M/s
   TimestampConversionStrptime     514370 ns       514289 ns         1345 items_per_second=19.4443M/s
   ```
   
   So there doesn't seem to be any reason for concern, a priori.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7901: ARROW-9543: [C++] Simplify parsing/formatting utilities

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7901:
URL: https://github.com/apache/arrow/pull/7901#discussion_r465753604



##########
File path: cpp/src/arrow/util/formatting.h
##########
@@ -40,12 +40,51 @@ namespace arrow {
 namespace internal {
 
 /// \brief The entry point for conversion to strings.
+///
+/// Specializations of FormatValueTraits for `ARROW_TYPE` must define:
+/// - A member type `value_type` which will be formatted.
+/// - A static member function `MaxSize(const ARROW_TYPE& t)` indicating the
+///   maximum number of characters which formatting a `value_type` may yield.
+/// - The static member function `Convert`, callable with signature

Review comment:
       I see, fair enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org