You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 12:06:44 UTC

[GitHub] [arrow] johanpel opened a new pull request #11836: ARROW-14903 [C++] Enable CSV Writer to control string to be used for missing data

johanpel opened a new pull request #11836:
URL: https://github.com/apache/arrow/pull/11836


   This adds the option to configure the string that is rendered in the CSV output for null values.
   If this null string has any quotes, they are escaped. I decided not to deal with the fact that the user can provide a string that cause rendered values to become ambiguous. E.g. when the user supplies `42` and an integer column has a null, it will render as `42` making it indistinguishable from the actual value of 42.
   The reason is that if we were to quote a valid value or the null value, this may annoy downstream applications that perform e.g. type inference for said column. But the most major drawback is that we'd need to check every valid value for equality to the configured null value, which I think may be a price too costly to pay for users making awkward choices for their null value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761095033



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/R"("NA")"),
+       expected_custom_quoted_na}};

Review comment:
       I'm having trouble getting this one to round-trip. With double quotes (though as Antoine points out this is wrong):
   
   ```
   >>> csv.read_csv("test2.csv", convert_options=csv.ConvertOptions(strings_can_be_null=True, quoted_strings_can_be_null=False, null_values=['"NA"'], column_types={"g": pyarrow.int64()}))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_csv.pyx", line 886, in pyarrow._csv.read_csv
     File "pyarrow/_csv.pyx", line 895, in pyarrow._csv.read_csv
     File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: In CSV column #0: CSV conversion error to int64: invalid value 'NA""'
   ```
   
   With triple quotes:
   
   ```
   >>> csv.read_csv("test2.csv", convert_options=csv.ConvertOptions(strings_can_be_null=True, quoted_strings_can_be_null=False, null_values=['"NA"'], column_types={"g": pyarrow.int64()}))
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_csv.pyx", line 886, in pyarrow._csv.read_csv
     File "pyarrow/_csv.pyx", line 895, in pyarrow._csv.read_csv
     File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: In CSV column #0: CSV conversion error to int64: invalid value '"NA"'
   ```
   
   Might be an issue on the read side. I don't think it's an issue to be addressed here.

##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na},

Review comment:
       This round-trips with `csv.ConvertOptions(strings_can_be_null=True, quoted_strings_can_be_null=False)` which is to be expected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johanpel commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
johanpel commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761105171



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/R"("NA")"),
+       expected_custom_quoted_na}};

Review comment:
       If for some reason the user uses `"NA"` (including quotes, so four characters) as the null value, whatever reads it would need to understand that after removing the escapes this does not represent the string 'NA' (two chars), but would need to parse `"NA"` as a null.
   
   If we actually want to prevent users from making such a mistake we should change the CSV write option struct to have an API for the null value string that checks it for quotes and rejects it if it does.
   
   The point is if you specify something weird as a null value for the writer, but not for the reader, the latter will not recognize it as a null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johanpel commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
johanpel commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761097732



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       This test case is for when a user supplies `"NA"` as the null value string.
   The idea is that all quotes in the null value string are escaped before rendering it for null values. This way it won't be confused with a rendered valid value string "NA". Hope this makes sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761098984



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Right. What I mean is that the result here seems wrong (in `expected_custom_quoted_na`). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #11836: ARROW-14903 [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-984567491






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johanpel commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
johanpel commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761115810



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       So the point is that if the value that represents null is `"NA"`, if we would render it as `"""NA"""`, this would imply this is a string (because of the outer quotes) and the string value is obtained after unescaping the inner quotes, which would be would be `"NA"` (:string). But is not a string, it's `"NA"` (:null).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #11836:
URL: https://github.com/apache/arrow/pull/11836


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761098989



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/R"("NA")"),
+       expected_custom_quoted_na}};

Review comment:
       Pandas can handle it:
   
   ```
   >>> pd.read_csv("test2.csv", dtype={"g": np.float64}, na_values=['"NA"'], keep_default_na=False)
         g
   0  42.0
   1   NaN
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-986163877


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d26af69075c442ad9d75f9a6417198a2...db0d04dc371747b496c4cfec6b0caa18/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/150e5b036abe49edbeef76e507320bf7...19c63885bfad46128daeb42336fad1f9/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9af13d1bb1ff43499e3114a07baef591...1c1cd2b6a4124e1d8585e5807de26cf8/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-986163877


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d26af69075c442ad9d75f9a6417198a2...db0d04dc371747b496c4cfec6b0caa18/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/150e5b036abe49edbeef76e507320bf7...19c63885bfad46128daeb42336fad1f9/)
   [Finished :arrow_down:0.35% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9af13d1bb1ff43499e3114a07baef591...1c1cd2b6a4124e1d8585e5807de26cf8/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985940779


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/448de220117e4e4a9140c855c0c427ef...a8ad475a03074138a77ca49eda0ecc9d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5b7bf75ccd0b4b1c8280f6b1a6de34eb...0e86d127e6254d9592624523b984db81/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c4efb338d4224d90a2d3173218a5e65f...42290700f4cb48a999354a4a29ba9dbd/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761106420



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/R"("NA")"),
+       expected_custom_quoted_na}};

Review comment:
       Right, I'm just pointing out that the Arrow CSV reader doesn't appear to handle this (or maybe I haven't found the right set of options).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761305684



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -289,11 +306,23 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
       io::OutputStream* sink, std::shared_ptr<io::OutputStream> owned_sink,
       std::shared_ptr<Schema> schema, const WriteOptions& options) {
     RETURN_NOT_OK(options.Validate());
+    // Reject null string values that contain quotes.
+    if (CountEscapes(options.null_string) != 0) {
+      return Status::Invalid("CSV null value string cannot contain quotes.",
+                             options.null_string);

Review comment:
       This will result in a formatted message such as `CSV null value string cannot contain quotes.ab"c`, perhaps arrange it a bit better? Or we can simply omit the actual `null_string` value since it is a user-defined parameter anyway.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761101620



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Here's with the `csv` module from the Python standard library:
   ```python
   >>> import csv
   >>> lines = ['NA,"NA",""NA"","""NA"""']
   >>> list(csv.reader(lines))
   [['NA', 'NA', 'NA""', '"NA"']]
   ```
   
   As you see, the right quoting-escaping for `"NA"` should be `"""NA"""` (three quotes).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761101620



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Here's with the `csv` module from the Python standard library:
   ```python
   >>> import csv
   >>> lines = ['NA,"NA",""NA"","""NA"""']
   >>> list(csv.reader(lines))
   [['NA', 'NA', 'NA""', '"NA"']]
   ```
   
   As you see, the right quoting-escaping for `"NA"` shuold be `"""NA"""` (three quotes).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985940779


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/448de220117e4e4a9140c855c0c427ef...a8ad475a03074138a77ca49eda0ecc9d/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5b7bf75ccd0b4b1c8280f6b1a6de34eb...0e86d127e6254d9592624523b984db81/)
   [Finished :arrow_down:0.31% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c4efb338d4224d90a2d3173218a5e65f...42290700f4cb48a999354a4a29ba9dbd/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-986163877


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d26af69075c442ad9d75f9a6417198a2...db0d04dc371747b496c4cfec6b0caa18/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/150e5b036abe49edbeef76e507320bf7...19c63885bfad46128daeb42336fad1f9/)
   [Finished :arrow_down:0.35% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9af13d1bb1ff43499e3114a07baef591...1c1cd2b6a4124e1d8585e5807de26cf8/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761082181



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Hmm, this doesn't look exactly right. I think it should be `"""NA"""`, not `""NA""`.

##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Though it seems a bug in the escaping logic, not in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johanpel commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
johanpel commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761097732



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       This test case is for when a user supplies `"NA"` (including actual quotes in the value) as the null value string.
   The idea is that all quotes in the null value string are escaped before rendering it for null values. This way it won't be confused with a rendered valid value string "NA". Hope this makes sense?

##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       This test case is for when a user supplies `"NA"` (including actual quotes) as the null value string.
   The idea is that all quotes in the null value string are escaped before rendering it for null values. This way it won't be confused with a rendered valid value string "NA". Hope this makes sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761101620



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Here's with the `csv` module from the Python standard library:
   ```python
   >>> import csv
   >>> lines = ['NA,"NA",""NA"","""NA"""']
   >>> csv.reader(lines)
   <_csv.reader at 0x7fc19c548f90>
   >>> list(csv.reader(lines))
   [['NA', 'NA', 'NA""', '"NA"']]
   ```
   
   As you see, the right quoting-escaping for `"NA"` shuold be `"""NA"""` (three quotes).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985940779


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/448de220117e4e4a9140c855c0c427ef...a8ad475a03074138a77ca49eda0ecc9d/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5b7bf75ccd0b4b1c8280f6b1a6de34eb...0e86d127e6254d9592624523b984db81/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c4efb338d4224d90a2d3173218a5e65f...42290700f4cb48a999354a4a29ba9dbd/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-986163877


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d26af69075c442ad9d75f9a6417198a2...db0d04dc371747b496c4cfec6b0caa18/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/150e5b036abe49edbeef76e507320bf7...19c63885bfad46128daeb42336fad1f9/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9af13d1bb1ff43499e3114a07baef591...1c1cd2b6a4124e1d8585e5807de26cf8/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-984648242


   Note we could simply disallow quotes in the null string. There's probably no use case for them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761306739



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,45 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                                  { },
+                                                  {"g": 1337, "h": "\"NA\""}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n" +                 // line 2
+                                   R"(1337,"""NA""")" + "\n";            // line 3
+
   return std::vector<WriterTestParams>{
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/false), ""},
-      {abc_schema, "[]", DefaultTestOptions(/*header=*/true), expected_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/false),
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
+       ""},
+      {abc_schema, "[]", DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header},
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/""),
        expected_without_header},
-      {abc_schema, populated_batch, DefaultTestOptions(/*header=*/true),
-       expected_header + expected_without_header}};
+      {abc_schema, populated_batch,
+       DefaultTestOptions(/*include_header=*/true, /*null_string=*/""),
+       expected_header + expected_without_header},
+      {schema_custom_na, populated_batch_custom_na,
+       DefaultTestOptions(/*include_header=*/false, /*null_string=*/"NA"),
+       expected_custom_na}};
 }

Review comment:
       Also add a test that putting a quote in `null_string` returns an error? You can use e.g. `ASSERT_RAISES` for that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985719698


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a65d11d55c684eedbb4eadcedb92d36b...a7cbee1da84346d49f5b532a39ed69d9/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/021fa5e1ffa444ef927aa55054afc11c...81fb94c6a92241e9be00ce32e528ff31/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/cadd0e169a274337a25787cbf7574078...c1022feb65ca49819c851e1fb83af325/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761091386



##########
File path: cpp/src/arrow/csv/writer_test.cc
##########
@@ -69,25 +70,49 @@ std::vector<WriterTestParams> GenerateTestCases() {
                              { "a": 124, "b\"": "a\"\"b\"" },
                              { "d": 0 },
                              { "e": 86400000 },
-                             { "f": 1078016523 }])";
-  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +       // line 1
-                                        R"(1,"abc""efg",2324,,,)" + "\n" +     // line 2
-                                        R"(,"abcd",5467,,,)" + "\n" +          // line 3
-                                        R"(,,,,,)" + "\n" +                    // line 4
-                                        R"(546,"",517,,,)" + "\n" +            // line 5
-                                        R"(124,"a""""b""",,,,)" + "\n" +       // line 6
-                                        R"(,,,1970-01-01,,)" + "\n" +          // line 7
-                                        R"(,,,,1970-01-02,)" + "\n" +          // line 8
-                                        R"(,,,,,2004-02-29 01:02:03)" + "\n";  // line 9
+                             { "f": 1078016523 },
+                             { "b\"": "NA" }])";
+  std::string expected_without_header = std::string("1,,-1,,,") + "\n" +        // line 1
+                                        R"(1,"abc""efg",2324,,,)" + "\n" +      // line 2
+                                        R"(,"abcd",5467,,,)" + "\n" +           // line 3
+                                        R"(,,,,,)" + "\n" +                     // line 4
+                                        R"(546,"",517,,,)" + "\n" +             // line 5
+                                        R"(124,"a""""b""",,,,)" + "\n" +        // line 6
+                                        R"(,,,1970-01-01,,)" + "\n" +           // line 7
+                                        R"(,,,,1970-01-02,)" + "\n" +           // line 8
+                                        R"(,,,,,2004-02-29 01:02:03)" + "\n" +  // line 9
+                                        R"(,"NA",,,,)" + "\n";                  // line 10
+
   std::string expected_header = std::string(R"("a","b""","c ","d","e","f")") + "\n";
 
+  auto schema_custom_na = schema({field("g", uint64()), field("h", utf8())});
+
+  auto populated_batch_custom_na = R"([{"g": 42, "h": "NA"},
+                                        {}])";
+
+  std::string expected_custom_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                   R"(NA,NA)" + "\n";                  // line 2
+
+  std::string expected_custom_quoted_na = std::string(R"(42,"NA")") + "\n" +  // line 1
+                                          R"(""NA"",""NA"")" + "\n";          // line 2

Review comment:
       Actually, no, the problem really seems to be in this PR, since otherwise the CSV writer works fine:
   ```python
   >>> tab = pa.table({'a': [0,1,2], 'b': ['foo', '"bar"', '""baz""']})
   >>> bio = io.BytesIO()
   >>> csv.write_csv(tab, bio)
   >>> bio.getvalue()
   b'"a","b"\n0,"foo"\n1,"""bar"""\n2,"""""baz"""""\n'
   >>> bio.seek(0)
   0
   >>> tt = csv.read_csv(bio)
   >>> tt
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[0,1,2]]
   b: [["foo",""bar"","""baz"""]]
   >>> tab == tt
   True
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] johanpel commented on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
johanpel commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-984652511


   > Note we could simply disallow quotes in the null string. There's probably no use case for them.
   
   That does make everything a lot easier.
   I'll update the PR with this suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#discussion_r761222585



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -140,12 +142,15 @@ char* EscapeReverse(arrow::util::string_view s, char* out_end) {
 // from a cast does not require quoting or escaping.
 class UnquotedColumnPopulator : public ColumnPopulator {
  public:
-  explicit UnquotedColumnPopulator(MemoryPool* memory_pool, char end_char)
-      : ColumnPopulator(memory_pool, end_char) {}
+  explicit UnquotedColumnPopulator(MemoryPool* memory_pool, char end_char,
+                                   std::shared_ptr<Buffer> null_string_)
+      : ColumnPopulator(memory_pool, end_char, std::move(null_string_)) {}
 
   Status UpdateRowLengths(int32_t* row_lengths) override {
     for (int x = 0; x < casted_array_->length(); x++) {
-      row_lengths[x] += casted_array_->value_length(x);
+      row_lengths[x] += casted_array_->value_length(x) == 0

Review comment:
       Hmm. Should this instead check `casted_array_->IsNull(x)`? While I don't think any kernels would generate data like this, `value_length` may not be zero for a null string.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985719698


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a65d11d55c684eedbb4eadcedb92d36b...a7cbee1da84346d49f5b532a39ed69d9/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/021fa5e1ffa444ef927aa55054afc11c...81fb94c6a92241e9be00ce32e528ff31/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/cadd0e169a274337a25787cbf7574078...c1022feb65ca49819c851e1fb83af325/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11836: ARROW-14903: [C++] Enable CSV Writer to control string to be used for missing data

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11836:
URL: https://github.com/apache/arrow/pull/11836#issuecomment-985940779


   Benchmark runs are scheduled for baseline = eee9df5b4f1631f3b4e6a2edacc98e850e9acaf2 and contender = 7240420a38d4b12317d54fc5195ddcaa5f35cf61. 7240420a38d4b12317d54fc5195ddcaa5f35cf61 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/448de220117e4e4a9140c855c0c427ef...a8ad475a03074138a77ca49eda0ecc9d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5b7bf75ccd0b4b1c8280f6b1a6de34eb...0e86d127e6254d9592624523b984db81/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c4efb338d4224d90a2d3173218a5e65f...42290700f4cb48a999354a4a29ba9dbd/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org