You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "vibhatha (via GitHub)" <gi...@apache.org> on 2023/06/23 10:48:46 UTC

[GitHub] [arrow] vibhatha opened a new pull request, #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

vibhatha opened a new pull request, #36266:
URL: https://github.com/apache/arrow/pull/36266

   ### Rationale for this change
   
   This PR fixes an already filed issue which has been evaluated as a bug.
   
   ### What changes are included in this PR?
   
   This PR includes a minor change to a the logic in `scalar_cast_string.cc`.
   
   ### Are these changes tested?
   
   I have added a Python test to evaluate the user's issue and replicated the bug.
   And the test case I have added test it parametricly. Also I tested existing C++
   tets.
   
   ### Are there any user-facing changes?
   
   Issue filed by a user: https://github.com/apache/arrow/issues/35901
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399661979


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()

Review Comment:
   Is this line actually necessary for this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403414855


##########
cpp/src/arrow/compute/kernels/scalar_cast_test.cc:
##########
@@ -2104,6 +2104,11 @@ TEST(Cast, BinaryToString) {
     // ARROW-16757: we no longer zero copy, but the contents are equal
     ASSERT_NE(invalid_utf8->data()->buffers[1].get(), strings->data()->buffers[2].get());
     ASSERT_TRUE(invalid_utf8->data()->buffers[1]->Equals(*strings->data()->buffers[2]));
+
+    // ARROW-35901: check that casting with first buffer being a `nullptr` works
+    auto fixed_array_null = ArrayFromJSON(from_type, "[]");
+    fixed_array_null->data()->buffers[0] = NULLPTR;
+    CheckCast(fixed_array_null, ArrayFromJSON(string_type, "[]"));

Review Comment:
   Did you make sure that this fails without the patch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1255383206


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1968,3 +1968,27 @@ def test_write_csv_decimal(tmpdir, type_factory):
     out = read_csv(tmpdir / "out.csv")
 
     assert out.column('col').cast(type) == table.column('col')
+
+
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    csvoption = WriteOptions(include_header=True, batch_size=2048,
+                             delimiter='|', quoting_style='none')
+
+    write_csv(fixed_table, file_name, write_options=csvoption)
+
+    assert os.path.exists(file_name)
+
+    res_table = read_csv(file_name)
+    assert fixed_table.shape, res_table.shape

Review Comment:
   Updated the code. Thanks for catching this one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1818848544

   I wasn't able to get the right indent though so I expect, at least, the linter error 😬 😁 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403418533


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')

Review Comment:
   (if it is, this means you could have a much smaller `data_size` and something like `batch_size = data_size // 4`? also with a comment mentioning why this is important for this test)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1414032793


##########
cpp/src/arrow/compute/kernels/scalar_cast_test.cc:
##########
@@ -2104,6 +2104,15 @@ TEST(Cast, BinaryToString) {
     // ARROW-16757: we no longer zero copy, but the contents are equal
     ASSERT_NE(invalid_utf8->data()->buffers[1].get(), strings->data()->buffers[2].get());
     ASSERT_TRUE(invalid_utf8->data()->buffers[1]->Equals(*strings->data()->buffers[2]));
+
+    // GH-35901: check that casting a sliced array
+    // when first buffer's data is a `nullptr`
+    auto fixed_array_null = ArrayFromJSON(from_type, "[\"123\", \"245\", \"345\"]");
+    fixed_array_null = fixed_array_null->Slice(1, 1);
+    fixed_array_null->data()->buffers[0] = std::make_shared<Buffer>(nullptr, 0);

Review Comment:
   This seems to produce an invalid array (see CI results). Perhaps you meant `fixed_array_null->data()->buffers[0] = nullptr`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399695380


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()

Review Comment:
   Not necessary anymore! It will be updated in the next push.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403417070


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"

Review Comment:
   I'm not sure why this is parameterizing the file name, but in any case, should try to follow PEP 8 and put spaces around operators.
   
   (also, you can use a f-string)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1398810429


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,38 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')
+    write_csv(fixed_table, file_name, write_options=write_options)
+
+    assert os.path.exists(file_name)
+
+    parse_options = ParseOptions(delimiter="|")
+    convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)})
+    read_options = ReadOptions(block_size=2048)
+
+    res_table = read_csv(file_name, parse_options=parse_options,
+                         convert_options=convert_options,
+                         read_options=read_options)
+    res_table = res_table.combine_chunks()
+
+    assert res_table.column(0).chunks[0] == fixed_table.column(0).chunks[0]

Review Comment:
   ```suggestion
       if not res_table.equals(fixed_table):
           # Better error output
           assert res_table.to_pydict() == fixed_table.to_pydict()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1820439523

   > Please explicitly describe the rationale in the PR description. "This PR fixes an already filed issue which has been evaluated as a bug" is not an appropriate description.
   
   @pitrou I updated the description. Is the updated description helpful?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403415485


##########
cpp/src/arrow/compute/kernels/scalar_cast_test.cc:
##########
@@ -2104,6 +2104,11 @@ TEST(Cast, BinaryToString) {
     // ARROW-16757: we no longer zero copy, but the contents are equal
     ASSERT_NE(invalid_utf8->data()->buffers[1].get(), strings->data()->buffers[2].get());
     ASSERT_TRUE(invalid_utf8->data()->buffers[1]->Equals(*strings->data()->buffers[2]));
+
+    // ARROW-35901: check that casting with first buffer being a `nullptr` works
+    auto fixed_array_null = ArrayFromJSON(from_type, "[]");
+    fixed_array_null->data()->buffers[0] = NULLPTR;

Review Comment:
   Let's make sure that this is a valid array?
   ```suggestion
       fixed_array_null->data()->buffers[0] = NULLPTR;
       ASSERT_OK(fixed_array_null->ValidateFull());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1821055144

   @vibhatha It is better like this indeed, though ideally there would be a diagnosis of the actual problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1412468131


##########
cpp/src/arrow/compute/kernels/scalar_cast_string.cc:
##########
@@ -338,10 +338,12 @@ BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* ou
   if (input.offset == output->offset) {
     output->buffers[0] = input.GetBuffer(0);
   } else {
-    ARROW_ASSIGN_OR_RAISE(
-        output->buffers[0],
-        arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
-                                    input.offset, input.length));
+    if (input.buffers[0].data != nullptr) {
+      ARROW_ASSIGN_OR_RAISE(
+          output->buffers[0],
+          arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
+                                      input.offset, input.length));
+    }

Review Comment:
   Test added!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on a diff in pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1260559545


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1968,3 +1968,28 @@ def test_write_csv_decimal(tmpdir, type_factory):
     out = read_csv(tmpdir / "out.csv")
 
     assert out.column('col').cast(type) == table.column('col')
+
+
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.string())

Review Comment:
   You're correct, I have missed that point. Test cases have been updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403417657


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')

Review Comment:
   There is nothing wrong with this, but the custom delimiter seems gratuitous. Is the batch size related to the data size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1402791954


##########
cpp/src/arrow/compute/kernels/scalar_cast_string.cc:
##########
@@ -338,10 +338,12 @@ BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* ou
   if (input.offset == output->offset) {
     output->buffers[0] = input.GetBuffer(0);
   } else {
-    ARROW_ASSIGN_OR_RAISE(
-        output->buffers[0],
-        arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
-                                    input.offset, input.length));
+    if (input.buffers[0].data != nullptr) {
+      ARROW_ASSIGN_OR_RAISE(
+          output->buffers[0],
+          arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
+                                      input.offset, input.length));
+    }

Review Comment:
   I attempted to expand the test! I'm still new to Arrow-C++, could you take a look at it? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399037832


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,41 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')
+    write_csv(fixed_table, file_name, write_options=write_options)
+
+    assert os.path.exists(file_name)
+
+    parse_options = ParseOptions(delimiter="|")
+    convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)})
+    read_options = ReadOptions(block_size=2048)
+
+    res_table = read_csv(file_name, parse_options=parse_options,
+                         convert_options=convert_options,
+                         read_options=read_options)
+    res_table = res_table.combine_chunks()
+
+    if not res_table.equals(fixed_table):
+        assert res_table.schema == fixed_table.schema
+        assert res_table.num_rows == fixed_table.num_rows
+        assert res_table.to_pydict() == fixed_table.to_pydict()

Review Comment:
   ```suggestion
       convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)},
                                                                   quoted_strings_can_be_null=False)
       read_options = ReadOptions(block_size=2048)
   
       res_table = read_csv(file_name, parse_options=parse_options,
                            convert_options=convert_options,
                            read_options=read_options)
       res_table = res_table.combine_chunks()
   
       assert res_table.equals(fixed_table)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] felipecrv commented on pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.

felipecrv commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1613247633

   Is this still a draft or ready for review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1412669913


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')

Review Comment:
   The `batch_size` doesn't matter much, but for the sake of error reproduction, I used it. But the data size is important. I agree with @anjakefala 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399660647


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)

Review Comment:
   Why the parameterization? What is the intent here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1819677004

   Please explicitly describe the rationale in the PR description. "This PR fixes an already filed issue which has been evaluated as a bug" is not an appropriate description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1818845823

   > I have found out the cause for the random flakey failures in the CI.
   > 
   > The code in the tests is generating a numpy array with random strings of length 4. They can also include the string `"NULL"` which currently gets converted to `None`.
   > Reproducible example
   > 
   > We should use `convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)}, quoted_strings_can_be_null=False)` so that `"NULL"` is not converted to `None`.
   > 
   > Will add the change via suggestions.
   
   Oh Wow! Awesome catch @AlenkaF :pray: 
   Thanks for working on this one. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399660192


##########
cpp/src/arrow/compute/kernels/scalar_cast_string.cc:
##########
@@ -338,10 +338,12 @@ BinaryToBinaryCastExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* ou
   if (input.offset == output->offset) {
     output->buffers[0] = input.GetBuffer(0);
   } else {
-    ARROW_ASSIGN_OR_RAISE(
-        output->buffers[0],
-        arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
-                                    input.offset, input.length));
+    if (input.buffers[0].data != nullptr) {
+      ARROW_ASSIGN_OR_RAISE(
+          output->buffers[0],
+          arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
+                                      input.offset, input.length));
+    }

Review Comment:
   This seems to fix a bug in the C++ kernels, so, could you add a test in this area as well?
   There are existing tests in https://github.com/apache/arrow/blob/f98a13250d10dba248a2bb85989d6b80265e82d8/cpp/src/arrow/compute/kernels/scalar_cast_test.cc#L2059 which you could expand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399691185


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)

Review Comment:
   From playing with it, `int(1E6)` is the one that actually reproduces the bug (results in a segfault on main). 
   
   Is there value in also having `int(1E2)` or `int(1E4)` to make sure there wasn't a regression with relatively smaller binary arrays? If there isn't, we can remove those parameters. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1400705695


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)

Review Comment:
   I don't think there is value in that, no. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403416193


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)

Review Comment:
   Can you add a comment mentioning the GH issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1398859556


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,41 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')
+    write_csv(fixed_table, file_name, write_options=write_options)
+
+    assert os.path.exists(file_name)
+
+    parse_options = ParseOptions(delimiter="|")
+    convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)})
+    read_options = ReadOptions(block_size=2048)
+
+    res_table = read_csv(file_name, parse_options=parse_options,
+                         convert_options=convert_options,
+                         read_options=read_options)
+    res_table = res_table.combine_chunks()
+
+    if not res_table.equals(fixed_table):
+        assert assert res_table.schema == fixed_table.schema
+        assert assert res_table.num_rows == fixed_table.num_rows

Review Comment:
   ```suggestion
           assert res_table.schema == fixed_table.schema
           assert res_table.num_rows == fixed_table.num_rows
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1403414521


##########
cpp/src/arrow/compute/kernels/scalar_cast_test.cc:
##########
@@ -2104,6 +2104,11 @@ TEST(Cast, BinaryToString) {
     // ARROW-16757: we no longer zero copy, but the contents are equal
     ASSERT_NE(invalid_utf8->data()->buffers[1].get(), strings->data()->buffers[2].get());
     ASSERT_TRUE(invalid_utf8->data()->buffers[1]->Equals(*strings->data()->buffers[2]));
+
+    // ARROW-35901: check that casting with first buffer being a `nullptr` works

Review Comment:
   ```suggestion
       // GH-35901: check that casting with first buffer being a `nullptr` works
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399691642


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()

Review Comment:
   It was with the original way I was comparing the chunks. But, Alenka changed the comparison method, so let's see if it's still needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399695380


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()

Review Comment:
   Not necessary anymore! It will be updated in the next push.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1399952148


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,39 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()

Review Comment:
   (It's not. =)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1412668343


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')

Review Comment:
   Turns out `batch_size` is not important, either!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1412464754


##########
cpp/src/arrow/compute/kernels/scalar_cast_test.cc:
##########
@@ -2104,6 +2104,11 @@ TEST(Cast, BinaryToString) {
     // ARROW-16757: we no longer zero copy, but the contents are equal
     ASSERT_NE(invalid_utf8->data()->buffers[1].get(), strings->data()->buffers[2].get());
     ASSERT_TRUE(invalid_utf8->data()->buffers[1]->Equals(*strings->data()->buffers[2]));
+
+    // ARROW-35901: check that casting with first buffer being a `nullptr` works
+    auto fixed_array_null = ArrayFromJSON(from_type, "[]");
+    fixed_array_null->data()->buffers[0] = NULLPTR;
+    CheckCast(fixed_array_null, ArrayFromJSON(string_type, "[]"));

Review Comment:
   It now passes with the patch and segfaults on main! The patch was doing the check after an `offset` check. This only affected sliced `Buffers`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.

vibhatha commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1613306753

   > Is this still a draft or ready for review?
   
   Appreciate a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1254516487


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1968,3 +1968,27 @@ def test_write_csv_decimal(tmpdir, type_factory):
     out = read_csv(tmpdir / "out.csv")
 
     assert out.column('col').cast(type) == table.column('col')
+
+
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    csvoption = WriteOptions(include_header=True, batch_size=2048,
+                             delimiter='|', quoting_style='none')
+
+    write_csv(fixed_table, file_name, write_options=csvoption)
+
+    assert os.path.exists(file_name)
+
+    res_table = read_csv(file_name)
+    assert fixed_table.shape, res_table.shape

Review Comment:
   ```suggestion
       assert fixed_table.shape == res_table.shape
   ```
   
   Also, why not compare the tables themselves and not just the shape?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #36266: GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1259944135


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1968,3 +1968,28 @@ def test_write_csv_decimal(tmpdir, type_factory):
     out = read_csv(tmpdir / "out.csv")
 
     assert out.column('col').cast(type) == table.column('col')
+
+
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.string())

Review Comment:
   ```suggestion
       fixed_arr = pa.array(nparr, pa.binary(4))
   ```
   This test, as written, doesn't actually reproduce the error.  Changing this to be fixed size binary does trigger the error (assuming you compile without your fix).  You will need to update the comparison though because fixed size binary will be read back as string without any other options.  So you will also need to specify the column type (using convert_options) when reading from CSV. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1398848067


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,40 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+@pytest.mark.parametrize("data_size", (
+    int(1E2),
+    int(1E4),
+    int(1E6)
+))
+def test_large_binary_write_to_csv(tmpdir, data_size):
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+    fixed_table = fixed_table.combine_chunks()
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')
+    write_csv(fixed_table, file_name, write_options=write_options)
+
+    assert os.path.exists(file_name)
+
+    parse_options = ParseOptions(delimiter="|")
+    convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)})
+    read_options = ReadOptions(block_size=2048)
+
+    res_table = read_csv(file_name, parse_options=parse_options,
+                         convert_options=convert_options,
+                         read_options=read_options)
+    res_table = res_table.combine_chunks()
+
+    if not res_table.equals(fixed_table):
+        # Better error output
+        assert res_table.to_pydict() == fixed_table.to_pydict()

Review Comment:
   ```suggestion
       if not res_table.equals(fixed_table):
           assert assert res_table.schema == fixed_table.schema
           assert assert res_table.num_rows == fixed_table.num_rows
           assert res_table.to_pydict() == fixed_table.to_pydict()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1818837257

   I have found out the cause for the random flakey failures in the CI.
   
   The code in the tests is generating a numpy array with random strings of length 4. They can also include the string `"NULL"` which currently gets converted to `None`.
   
   <details>
   <summary>Reproducible example</summary>
   
   ```python
   import pyarrow as pa
   import numpy as np
   import os
   from pyarrow.csv import (read_csv, write_csv,
                            ConvertOptions, ParseOptions,
                            ReadOptions, WriteOptions)
   
   file_name = "fixedsize_NULL"
   nparr = np.array([b'NULL'])
   fixed_arr = pa.array(nparr, pa.binary(4))
   fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
   
   write_options = WriteOptions(include_header=True, batch_size=2048,
                                   delimiter='|')
   write_csv(fixed_table, file_name, write_options=write_options)
   
   assert os.path.exists(file_name)
   
   parse_options = ParseOptions(delimiter="|")
   convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)})
   read_options = ReadOptions(block_size=2048)
   
   res_table = read_csv(file_name, parse_options=parse_options,
                           convert_options=convert_options,
                           read_options=read_options)
   ```
   
   ```python
   >>> nparr
   array([b'NULL'], dtype='|S4')
   >>> fixed_table
   pyarrow.Table
   fixedsize: fixed_size_binary[4]
   ----
   fixedsize: [[4E554C4C]]
   >>> res_table
   pyarrow.Table
   fixedsize: fixed_size_binary[4]
   ----
   fixedsize: [[null]]
   ```
   </details>
   
   We should use `convert_options = ConvertOptions(column_types={"fixedsize": pa.binary(4)}, quoted_strings_can_be_null=False)` so that `"NULL"` is not converted to `None`.
   
   Will add the change via suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on PR #36266:
URL: https://github.com/apache/arrow/pull/36266#issuecomment-1818617178

   Hm, now the Python 3.9 build is passing and 3.8 is failing. I am creating 3.9 env locally to see what I can find, hope to stop the suggest change-commit spam =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.

anjakefala commented on code in PR #36266:
URL: https://github.com/apache/arrow/pull/36266#discussion_r1412662188


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -1972,6 +1972,33 @@ def test_write_csv_decimal(tmpdir, type_factory):
     assert out.column('col').cast(type) == table.column('col')
 
 
+def test_large_binary_write_to_csv(tmpdir):
+    data_size = int(1E6)
+    file_name = tmpdir / "fixedsize_"+str(data_size)+".csv"
+
+    nparr = np.frombuffer(np.random.randint(65, 91, data_size, 'u1'), 'S4')
+
+    fixed_arr = pa.array(nparr, pa.binary(4))
+    fixed_table = pa.Table.from_arrays([fixed_arr], names=['fixedsize'])
+
+    write_options = WriteOptions(include_header=True, batch_size=2048,
+                                 delimiter='|')

Review Comment:
   I'm pretty sure both `batch_size` and `delimiter` were picked up from the code that originally caught the bug: https://github.com/apache/arrow/issues/35901.
   
   The `data_size` seems important for triggering the case (smaller `data_sizes` weren't triggering it). I'll play with `batch_size`, and remove the delimiter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org