You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2022/04/08 22:20:00 UTC

[jira] [Created] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches

Micah Kornfield created ARROW-16160:
---------------------------------------

             Summary: [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches
                 Key: ARROW-16160
                 URL: https://issues.apache.org/jira/browse/ARROW-16160
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
    Affects Versions: 6.0.1
            Reporter: Micah Kornfield


I looked through recent commits and I don't think this issue has been patched since:

{code:title=test.python|borderStyle=solid}
import pyarrow as pa
with pa.output_stream("/tmp/f1") as sink:
  with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
    writer.write(rb1)
    end_rb1 = sink.tell()

with pa.output_stream("/tmp/f2") as sink:
  with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
    writer.write(rb2)
    start_rb2_only = sink.tell()
    writer.write(rb2)
    end_rb2 = sink.tell()

# Stitch to togher rb1.schema, rb1 and rb2 without schema.
with pa.output_stream("/tmp/f3") as sink:
  with pa.input_stream("/tmp/f1") as inp:
     sink.write(inp.read(end_rb1))
  with pa.input_stream("/tmp/f2") as inp:
    inp.seek(start_rb2_only)
    sink.write(inp.read(end_rb2 - start_rb2_only))

with pa.ipc.open_stream("/tmp/f3") as sink:
  print(sink.read_all())
{code}
Yields:
{code}
{{pyarrow.Table
c1: int64
----
c1: [[1],[1]]
{code}

I would expect this to error because the second stiched in record batch has more fields then necessary but it appears to load just fine.  

Is this intended behavior?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)