You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/15 13:11:36 UTC

[GitHub] [arrow] pitrou commented on a change in pull request #10505: ARROW-12995: [C++] Add validation to CSV options

pitrou commented on a change in pull request #10505:
URL: https://github.com/apache/arrow/pull/10505#discussion_r651773526



##########
File path: cpp/src/arrow/csv/options.cc
##########
@@ -17,11 +17,32 @@
 
 #include "arrow/csv/options.h"
 
+#include <iomanip>
+
 namespace arrow {
 namespace csv {
 
 ParseOptions ParseOptions::Defaults() { return ParseOptions(); }
 
+Status ParseOptions::Validate() const {
+  if (ARROW_PREDICT_FALSE((delimiter < ' ' && delimiter != '\t') || delimiter > '~')) {
+    return Status::Invalid(
+        "ParseOptions: delimiter must be a printable ascii char or '\\t': 0x",
+        std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter));

Review comment:
       Hmm, I don't think I understand this check. Is there something in the CSV parser that currently prevents using other delimiters?

##########
File path: cpp/src/arrow/csv/options.cc
##########
@@ -17,11 +17,32 @@
 
 #include "arrow/csv/options.h"
 
+#include <iomanip>
+
 namespace arrow {
 namespace csv {
 
 ParseOptions ParseOptions::Defaults() { return ParseOptions(); }
 
+Status ParseOptions::Validate() const {
+  if (ARROW_PREDICT_FALSE((delimiter < ' ' && delimiter != '\t') || delimiter > '~')) {
+    return Status::Invalid(
+        "ParseOptions: delimiter must be a printable ascii char or '\\t': 0x",
+        std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter));
+  }
+  if (ARROW_PREDICT_FALSE(quoting && (quote_char < ' ' || quote_char > '~'))) {

Review comment:
       Same questions below.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -58,6 +58,7 @@ cdef class ReadOptions(_Weakrefable):
         How much bytes to process at a time from the input stream.
         This will determine multi-threading granularity as well as
         the size of individual record batches or table chunks.
+        Minimum valid value for block size is 1KB

Review comment:
       You mean 1B, no?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org