You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "zitan-guo (via GitHub)" <gi...@apache.org> on 2023/06/04 13:00:16 UTC

[GitHub] [arrow] zitan-guo opened a new issue, #35901: [Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

zitan-guo opened a new issue, #35901:
URL: https://github.com/apache/arrow/issues/35901

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   
   Suppose we have a large array of binary strings
   ```
   import numpy as np
   import pyarrow as pa
   import pyarrow.csv as pcsv
   
   nparr = np.frombuffer(np.random.randint(65, 91, int(4E8), 'u1'), 'S4')
   ```
   We want to construct a FixedSizedBinaryArray and Table
   ```
   fixedarr = pa.array(nparr, pa.binary(4))
   fixedtable = pa.Table.from_arrays([fixedarr], names=['fixedsize'])
   ```
   
   Alternatively, construct a BinaryArray and Table
   ```
   binarr = pa.array(nparr, pa.binary())
   bintable = pa.Table.from_arrays([binarr], names=['binary'])
   ```
   Setting up the csv write option
   ```
   csvoption = pcsv.WriteOptions(include_header=False, batch_size=2048, delimiter='|', quoting_style='none')
   ```
   Writing the table containing BinaryArray works
   ```
   pcsv.write_csv(bintable, 'binary.csv', write_options=csvoption)
   ```
   
   But writing the table containing FixedSizedBinaryArray crashes after first `batch_size`
   ```
   pcsv.write_csv(fixedtable, 'fixedsize.csv', write_options=csvoption)
   ```
   Note both tables are compatible with `pyarrow.parquet.write_table` though.
   
   
   **Environment**: 
   Windows 10
   pyarrow 12.0.0
   Python 3.10.10
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] vibhatha commented on issue #35901: [C++][Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "vibhatha (via GitHub)" <gi...@apache.org>.
vibhatha commented on issue #35901:
URL: https://github.com/apache/arrow/issues/35901#issuecomment-1589372459

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35901: [Python] pyarrow.csv.write_csv crashes when writing tables containing FixedSizeBinaryArray

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35901:
URL: https://github.com/apache/arrow/issues/35901#issuecomment-1585291160

   This appears to be a bug in the cast function that goes from FixedSizeBinaryArray to StringArray:
   
   In `BinaryToBinaryCastExec` in `scalar_cast_string.cc` we have:
   
   ```
     if (input.offset == output->offset) {
       output->buffers[0] = input.GetBuffer(0);
     } else {
       ARROW_ASSIGN_OR_RAISE(
           output->buffers[0],
           arrow::internal::CopyBitmap(ctx->memory_pool(), input.buffers[0].data,
                                       input.offset, input.length));
     }
   ```
   
   Unfortunately, `input.buffers[0].data` might be `nullptr` (since we can omit the validity bitmap if everything is valid) and that causes `CopyBitmap` to fail.
   
   A full stack trace is:
   
   ```
   #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
   #1  0x00007fff711f26e7 in arrow::internal::TransferBitmap<(arrow::internal::TransferMode)0> (data=0x100 <error: Cannot access memory at address 0x100>, offset=2048, length=2048, dest_offset=0, 
       dest=0x5555566193c0 "") at /home/pace/dev/arrow/cpp/src/arrow/util/bitmap_ops.cc:163
   #2  0x00007fff711f2c0f in arrow::internal::TransferBitmap<(arrow::internal::TransferMode)0> (pool=0x7fff73311508 <arrow::global_state+8>, data=0x0, offset=2048, length=2048)
       at /home/pace/dev/arrow/cpp/src/arrow/util/bitmap_ops.cc:222
   #3  0x00007fff711ee979 in arrow::internal::CopyBitmap (pool=0x7fff73311508 <arrow::global_state+8>, data=0x0, offset=2048, length=2048) at /home/pace/dev/arrow/cpp/src/arrow/util/bitmap_ops.cc:252
   #4  0x00007fff7165b1c2 in arrow::compute::internal::(anonymous namespace)::BinaryToBinaryCastExec<arrow::StringType, arrow::FixedSizeBinaryType> (ctx=0x555556670288, batch=..., out=0x7fffffffc710)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/kernels/scalar_cast_string.cc:341
   #5  0x00007fff715369a1 in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::ExecuteNonSpans (this=0x5555564a37d0, listener=0x7fffffffc930) at /home/pace/dev/arrow/cpp/src/arrow/compute/exec.cc:920
   #6  0x00007fff7153551c in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::Execute (this=0x5555564a37d0, batch=..., listener=0x7fffffffc930)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/exec.cc:810
   #7  0x00007fff7157e88b in arrow::compute::detail::FunctionExecutorImpl::Execute (this=0x555556670260, args=std::vector of length 1, capacity 1 = {...}, passed_length=-1)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/function.cc:276
   #8  0x00007fff7157be03 in arrow::compute::(anonymous namespace)::ExecuteInternal (func=..., args=std::vector of length 1, capacity 1 = {...}, passed_length=-1, options=0x7fffffffcea0, ctx=0x7fffffffd020)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/function.cc:341
   #9  0x00007fff7157bf9e in arrow::compute::Function::Execute (this=0x5555568088e0, args=std::vector of length 1, capacity 1 = {...}, options=0x7fffffffcea0, ctx=0x7fffffffd020)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/function.cc:348
   #10 0x00007fff715229b7 in arrow::compute::internal::(anonymous namespace)::CastMetaFunction::ExecuteImpl (this=0x5555562ae610, args=std::vector of length 1, capacity 1 = {...}, options=0x7fffffffcea0, 
       ctx=0x7fffffffd020) at /home/pace/dev/arrow/cpp/src/arrow/compute/cast.cc:124
   #11 0x00007fff7157d37e in arrow::compute::MetaFunction::Execute (this=0x5555562ae610, args=std::vector of length 1, capacity 1 = {...}, options=0x7fffffffcea0, ctx=0x7fffffffd020)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/function.cc:481
   #12 0x00007fff7153a6ba in arrow::compute::CallFunction (func_name="cast", args=std::vector of length 1, capacity 1 = {...}, options=0x7fffffffcea0, ctx=0x7fffffffd020)
       at /home/pace/dev/arrow/cpp/src/arrow/compute/exec.cc:1369
   #13 0x00007fff71523bbc in arrow::compute::Cast (value=..., options=..., ctx=0x7fffffffd020) at /home/pace/dev/arrow/cpp/src/arrow/compute/cast.cc:229
   #14 0x00007fff71523d32 in arrow::compute::Cast (value=..., to_type=..., options=..., ctx=0x7fffffffd020) at /home/pace/dev/arrow/cpp/src/arrow/compute/cast.cc:236
   #15 0x00007fff71523dfe in arrow::compute::Cast (value=..., to_type=..., options=..., ctx=0x7fffffffd020) at /home/pace/dev/arrow/cpp/src/arrow/compute/cast.cc:241
   #16 0x00007fff71445a0d in arrow::csv::(anonymous namespace)::ColumnPopulator::UpdateRowLengths (this=0x55555673e540, data=..., row_lengths=0x555556a25bc0) at /home/pace/dev/arrow/cpp/src/arrow/csv/writer.cc:131
   #17 0x00007fff714498bb in arrow::csv::(anonymous namespace)::CSVWriterImpl::TranslateMinimalBatch (this=0x555556642100, batch=...) at /home/pace/dev/arrow/cpp/src/arrow/csv/writer.cc:561
   #18 0x00007fff7144888b in arrow::csv::(anonymous namespace)::CSVWriterImpl::WriteTable (this=0x555556642100, table=..., max_chunksize=-1) at /home/pace/dev/arrow/cpp/src/arrow/csv/writer.cc:485
   #19 0x00007fff726d513a in arrow::ipc::RecordBatchWriter::WriteTable (this=0x555556642100, table=...) at /home/pace/dev/arrow/cpp/src/arrow/ipc/writer.cc:1030
   #20 0x00007fff7144a028 in arrow::csv::WriteCSV (table=..., options=..., output=0x555556141990) at /home/pace/dev/arrow/cpp/src/arrow/csv/writer.cc:609
   #21 0x00007fff6e61b5ae in __pyx_pw_7pyarrow_4_csv_7write_csv(_object*, _object* const*, long, _object*) () from /home/pace/dev/arrow/python/pyarrow/_csv.cpython-311-x86_64-linux-gnu.so
   #22 0x000055555574bec1 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fff6e759e50, tstate=0x555555ae0d98 <_PyRuntime+166328>)
       at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_call.h:92
   #23 PyObject_Vectorcall (callable=0x7fff6e759e50, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/call.c:299
   #24 0x000055555573ec6c in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:4772
   #25 0x000055555573a1fb in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb0020, tstate=0x555555ae0d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_ceval.h:73
   #26 _PyEval_Vector (tstate=0x555555ae0d98 <_PyRuntime+166328>, func=0x7ffff6dd1f80, locals=<optimized out>, args=0x0, argcount=<optimized out>, kwnames=<optimized out>)
       at /usr/local/src/conda/python-3.11.0/Python/ceval.c:6428
   #27 0x00005555558049af in PyEval_EvalCode (co=<optimized out>, globals=0x7ffff6df2d40, locals=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:1154
   #28 0x0000555555827479 in run_eval_code_obj (tstate=0x555555ae0d98 <_PyRuntime+166328>, co=0x555555bb2900, globals=0x7ffff6df2d40, locals=0x7ffff6df2d40)
       at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1714
   #29 0x0000555555823724 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff6df2d40, locals=0x7ffff6df2d40, flags=<optimized out>, arena=<optimized out>)
       at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1735
   #30 0x00005555558387d2 in pyrun_file (fp=fp@entry=0x555555b48c10, filename=filename@entry=0x7ffff6d53270, start=start@entry=257, globals=globals@entry=0x7ffff6df2d40, locals=locals@entry=0x7ffff6df2d40, 
       closeit=closeit@entry=1, flags=0x7fffffffd888) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1630
   #31 0x000055555583812f in _PyRun_SimpleFileObject (fp=0x555555b48c10, filename=0x7ffff6d53270, closeit=1, flags=0x7fffffffd888) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:440
   #32 0x0000555555837f03 in _PyRun_AnyFileObject (fp=0x555555b48c10, filename=0x7ffff6d53270, closeit=1, flags=0x7fffffffd888) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:79
   #33 0x000055555583206c in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6d53270, program_name=0x7ffff6c1f690) at /usr/local/src/conda/python-3.11.0/Modules/main.c:360
   #34 pymain_run_file (config=0x555555ac6de0 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:379
   #35 pymain_run_python (exitcode=0x7fffffffd880) at /usr/local/src/conda/python-3.11.0/Modules/main.c:601
   #36 Py_RunMain () at /usr/local/src/conda/python-3.11.0/Modules/main.c:680
   #37 0x00005555557f33f9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:734
   #38 0x00007ffff7c29d90 in __libc_start_call_main (main=main@entry=0x5555557f3350 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffdad8) at ../sysdeps/nptl/libc_start_call_main.h:58
   #39 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x5555557f3350 <main>, argc=2, argv=0x7fffffffdad8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdac8)
       at ../csu/libc-start.c:392
   #40 0x00005555557f32a1 in _start ()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org