You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/07 21:29:50 UTC

[GitHub] [arrow] wjones127 opened a new pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

wjones127 opened a new pull request #11892:
URL: https://github.com/apache/arrow/pull/11892


   `arrow::csv::StreamingReader` already has handling for byte order marks (BOM). However, #7896 introduced `arrow::dataset::GetColumnNames` which is called prior to instantiating the reader and was missing BOM handling. This PR adds BOM handling to that method.
   
   Without BOM handling, the first column as parsed by `arrow::dataset::GetColumnNames` contained the BOM (e.g. was `"<BOM>a"` instead of `"a"`). Because of this, it failed the test on line 120 below and was not added to `convert_options.include_columns`. 
   
   https://github.com/apache/arrow/blob/9cf4275a19c994879172e5d3b03ade9a96a10721/cpp/src/arrow/dataset/file_csv.cc#L117-L122
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764762662



##########
File path: r/tests/testthat/test-csv.R
##########
@@ -447,5 +447,13 @@ test_that("write_csv_arrow deals with duplication in include_headers/col_names",
   )
   expect_true(file.exists(csv_file))
   expect_identical(tbl_no_dates, written_tbl)
+})
+
+test_that("read_csv_arrow() deals with BOMs (bite-order-marks) correctly", {

Review comment:
       "byte"

##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (bite-order-marks) correctly", {

Review comment:
       "byte"

##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -85,7 +86,13 @@ Result<std::unordered_set<std::string>> GetColumnNames(
 
   RETURN_NOT_OK(
       parser.VisitLastRow([&](const uint8_t* data, uint32_t size, bool quoted) -> Status {
-        util::string_view view{reinterpret_cast<const char*>(data), size};
+        // Skip BOM when reading column names (ARROW-14644)
+        ARROW_ASSIGN_OR_RAISE(auto data_no_bom, util::SkipUTF8BOM(data, size));
+        ptrdiff_t offset = data_no_bom - data;
+        DCHECK_GE(offset, 0);

Review comment:
       This doesn't seem useful. You can therefore shorten this to:
   ```c++
           size = size - static_cast<uint32_t>(data_no_bom - data);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764388441



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -85,7 +86,13 @@ Result<std::unordered_set<std::string>> GetColumnNames(
 
   RETURN_NOT_OK(
       parser.VisitLastRow([&](const uint8_t* data, uint32_t size, bool quoted) -> Status {
-        util::string_view view{reinterpret_cast<const char*>(data), size};
+        // Skip BOM when reading column names (ARROW-14644)
+        ARROW_ASSIGN_OR_RAISE(auto data2, util::SkipUTF8BOM(data, size));

Review comment:
       Minor nit: I don't love `data2` as a name.  Maybe `data_no_bom` or `sliced_data`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r765138850



##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (byte-order-marks) correctly", {
+  writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
+
+  expect_equal(
+    open_dataset(csv_file, format = "csv") %>% collect(),
+    tibble(a = 1, b = 2)
+  )

Review comment:
       Would it be possible to make two csv files and read those in as a dataset? This _should_ work just the same, but it would be good to confirm that the second csv has the proper BOM handling as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

wjones127 commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r765144252



##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (byte-order-marks) correctly", {
+  writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
+
+  expect_equal(
+    open_dataset(csv_file, format = "csv") %>% collect(),
+    tibble(a = 1, b = 2)
+  )

Review comment:
       Oh that's a good idea. I'll add that really quick.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#issuecomment-989185185


   Benchmark runs are scheduled for baseline = 001f47eb05f722d8e34b123e6673eeb8be836965 and contender = 62db4b6a2545da29279ee5c138b5f531067d802a. 62db4b6a2545da29279ee5c138b5f531067d802a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7d4611d9b3d14d94a37da1fce8494cd8...10e7ff5e4dae441ea25cff4add4096d3/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8ef159c8011c41c8ade5b1c7be946449...0be9d7e3c3d14e16ad47f942b5a5941a/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8481284679d04731ab18e4a95360c1b0...aac2c1901c974a06a274b4c165ce99fa/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764459729



##########
File path: r/tests/testthat/test-csv.R
##########
@@ -447,5 +447,13 @@ test_that("write_csv_arrow deals with duplication in include_headers/col_names",
   )
   expect_true(file.exists(csv_file))
   expect_identical(tbl_no_dates, written_tbl)
+})
+
+test_that("read_csv_arrow() deals with BOMs (bite-order-marks) correctly", {
+  writeLines('\xef\xbb\xbfa,b\n1,2\n', con = csv_file)

Review comment:
       Sorry, I should have noticed this before but it looks like there are some R style violations
   ```suggestion
     writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#issuecomment-989185185


   Benchmark runs are scheduled for baseline = 001f47eb05f722d8e34b123e6673eeb8be836965 and contender = 62db4b6a2545da29279ee5c138b5f531067d802a. 62db4b6a2545da29279ee5c138b5f531067d802a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7d4611d9b3d14d94a37da1fce8494cd8...10e7ff5e4dae441ea25cff4add4096d3/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8ef159c8011c41c8ade5b1c7be946449...0be9d7e3c3d14e16ad47f942b5a5941a/)
   [Finished :arrow_down:0.31% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8481284679d04731ab18e4a95360c1b0...aac2c1901c974a06a274b4c165ce99fa/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dragosmg commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

dragosmg commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r765803093



##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (byte-order-marks) correctly", {
+  writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
+
+  expect_equal(
+    open_dataset(csv_file, format = "csv") %>% collect(),
+    tibble(a = 1, b = 2)
+  )

Review comment:
       @wjones127 @jonkeane I think the additional tests are now failing. I got [this](https://github.com/apache/arrow/runs/4470780115?check_suite_focus=true) in a different PR. I haven't yet rebased so I don't even have those tests in my branch.

##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (byte-order-marks) correctly", {
+  writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
+
+  expect_equal(
+    open_dataset(csv_file, format = "csv") %>% collect(),
+    tibble(a = 1, b = 2)
+  )

Review comment:
       @wjones127 @jonkeane I think the additional tests are now failing. I got [this](https://github.com/apache/arrow/runs/4470780115?check_suite_focus=true) CI failure in a different PR. I haven't yet rebased so I don't even have those tests in my branch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#issuecomment-989185185


   Benchmark runs are scheduled for baseline = 001f47eb05f722d8e34b123e6673eeb8be836965 and contender = 62db4b6a2545da29279ee5c138b5f531067d802a. 62db4b6a2545da29279ee5c138b5f531067d802a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7d4611d9b3d14d94a37da1fce8494cd8...10e7ff5e4dae441ea25cff4add4096d3/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8ef159c8011c41c8ade5b1c7be946449...0be9d7e3c3d14e16ad47f942b5a5941a/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8481284679d04731ab18e4a95360c1b0...aac2c1901c974a06a274b4c165ce99fa/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane closed pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

jonkeane closed pull request #11892:
URL: https://github.com/apache/arrow/pull/11892


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#issuecomment-988276105






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764459811



##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (bite-order-marks) correctly", {
+  writeLines('\xef\xbb\xbfa,b\n1,2\n', con = csv_file)

Review comment:
       ```suggestion
     writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764437960



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -85,7 +86,13 @@ Result<std::unordered_set<std::string>> GetColumnNames(
 
   RETURN_NOT_OK(
       parser.VisitLastRow([&](const uint8_t* data, uint32_t size, bool quoted) -> Status {
-        util::string_view view{reinterpret_cast<const char*>(data), size};
+        // Skip BOM when reading column names (ARROW-14644)
+        ARROW_ASSIGN_OR_RAISE(auto data_no_bom, util::SkipUTF8BOM(data, size));
+        int32_t offset = data_no_bom - data;
+        DCHECK_GE(offset, 0);
+        size = size - offset;

Review comment:
       ```suggestion
           ptrdiff_t offset = data_no_bom - data;
           DCHECK_GE(offset, 0);
           size = size - static_cast<uint32_t>(offset);
   ```
   The Windows CI job is failing.
   
   I think this will make windows happy but I never know and I don't have this in my IDE at the moment so I may be way off base here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r765810883



##########
File path: r/tests/testthat/test-dataset-csv.R
##########
@@ -288,3 +289,12 @@ test_that("Column names inferred from schema for headerless CSVs (ARROW-14063)",
   ds <- open_dataset(headerless_csv_dir, format = "csv", schema = schema(int = int32(), dbl = float64()))
   expect_equal(ds %>% collect(), tbl)
 })
+
+test_that("open_dataset() deals with BOMs (byte-order-marks) correctly", {
+  writeLines("\xef\xbb\xbfa,b\n1,2\n", con = csv_file)
+
+  expect_equal(
+    open_dataset(csv_file, format = "csv") %>% collect(),
+    tibble(a = 1, b = 2)
+  )

Review comment:
       @dragosmg I filed https://issues.apache.org/jira/browse/ARROW-15041




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#issuecomment-989185185


   Benchmark runs are scheduled for baseline = 001f47eb05f722d8e34b123e6673eeb8be836965 and contender = 62db4b6a2545da29279ee5c138b5f531067d802a. 62db4b6a2545da29279ee5c138b5f531067d802a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7d4611d9b3d14d94a37da1fce8494cd8...10e7ff5e4dae441ea25cff4add4096d3/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8ef159c8011c41c8ade5b1c7be946449...0be9d7e3c3d14e16ad47f942b5a5941a/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8481284679d04731ab18e4a95360c1b0...aac2c1901c974a06a274b4c165ce99fa/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a change in pull request #11892: ARROW-14644: [C++][R] open_dataset doesn't ignore BOM in csv file

Posted by GitBox <gi...@apache.org>.

wjones127 commented on a change in pull request #11892:
URL: https://github.com/apache/arrow/pull/11892#discussion_r764441816



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -85,7 +86,13 @@ Result<std::unordered_set<std::string>> GetColumnNames(
 
   RETURN_NOT_OK(
       parser.VisitLastRow([&](const uint8_t* data, uint32_t size, bool quoted) -> Status {
-        util::string_view view{reinterpret_cast<const char*>(data), size};
+        // Skip BOM when reading column names (ARROW-14644)
+        ARROW_ASSIGN_OR_RAISE(auto data_no_bom, util::SkipUTF8BOM(data, size));
+        int32_t offset = data_no_bom - data;
+        DCHECK_GE(offset, 0);
+        size = size - offset;

Review comment:
       Thanks, I wasn't sure what the best way to explicitly cast was. Those changes work locally on my Mac; hopefully will work in Windows builds.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org