You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/15 14:51:02 UTC

[GitHub] [arrow] n3world commented on a change in pull request #10505: ARROW-12995: [C++] Add validation to CSV options

n3world commented on a change in pull request #10505:
URL: https://github.com/apache/arrow/pull/10505#discussion_r651865764



##########
File path: cpp/src/arrow/csv/options.cc
##########
@@ -17,11 +17,32 @@
 
 #include "arrow/csv/options.h"
 
+#include <iomanip>
+
 namespace arrow {
 namespace csv {
 
 ParseOptions ParseOptions::Defaults() { return ParseOptions(); }
 
+Status ParseOptions::Validate() const {
+  if (ARROW_PREDICT_FALSE((delimiter < ' ' && delimiter != '\t') || delimiter > '~')) {
+    return Status::Invalid(
+        "ParseOptions: delimiter must be a printable ascii char or '\\t': 0x",
+        std::setfill('0'), std::setw(2), std::hex, static_cast<uint16_t>(delimiter));

Review comment:
       A lot of the characters before ` ` are terminal control characters and the only ascii character after `~` is `del`. While these characters could be used in theory for a text file they would be very strange couldn't be displayed on a terminal. These are the characters this is excluding https://flaviocopes.com/non-printable-ascii-characters/ . Also, I don't think the user could specify values like `\n` or `\r` since that would break the parser since they have to be used for end of line. Currently the python layer requires all character values to be between 1 - 127 inclusive.
   I kept tab because tab separated values is a format which is used.

##########
File path: python/pyarrow/_csv.pyx
##########
@@ -58,6 +58,7 @@ cdef class ReadOptions(_Weakrefable):
         How much bytes to process at a time from the input stream.
         This will determine multi-threading granularity as well as
         the size of individual record batches or table chunks.
+        Minimum valid value for block size is 1KB

Review comment:
       Ooops, I wanted 1KB but tripped over tests. Guess I missed a spot




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org