You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Kyle Barron <ky...@gmail.com> on 2022/02/28 03:28:27 UTC

[Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

Hello!

I've used Arrow a decent bit in Python and JS but I'm pretty new to Rust.
I'm trying to write a  minimal binding of Rust's Parquet to WebAssembly in
order to decode Parquet files to Arrow on the web. I have code that works
<https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but only
some of the time. For example this test data
<https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
 (created here
<https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
seems to work with the js arrow.RecordBatchReader
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
 but other test data
<https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
 (created here
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
raises with "Error: Expected to read 1249648 metadata bytes, but only read
300.".

Based on logging, it *seems* as if parsing the Parquet file goes smoothly.
It's only writing the Arrow IPC format that fails (on the JS side when
trying to verify it). I'm currently trying to create the StreamWriter
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
then write all the Arrow RecordBatches into the writer
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
then finish the writer
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
and send the output back to JS
<https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
.

Has anyone seen a similar problem before, or any suggestions of where to
debug further? Alternatively, if an end-to-end example exists of reading
from a parquet file and returning an Arrow buffer would be very helpful to
see.

Best,
Kyle Barron

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

Posted by Andrew Lamb <al...@influxdata.com>.
I am glad you got it working!

On Thu, Mar 10, 2022 at 12:34 PM Kyle Barron <ky...@gmail.com> wrote:

> Thanks to both!
>
> I did more debugging last night and I believe the entire issue was `unsafe
> { Uint8Array::view(&file) }`
> <https://github.com/kylebarron/parquet-wasm/blob/9aee64343b76c1c6b7550f9d27aede327f9a1b75/src/lib.rs#L140>
> was unsafe 😄. I originally copied that from Dominik Moritz's
> `arrow-wasm`'s `Table.serialize`
> <https://github.com/domoritz/arrow-wasm/blob/3d6d4c6ab940fd317c4a19610cd204a06dc29584/src/table.rs#L58-L76>,
> and just assumed it was ok usage. But when I instead create a new
> `js_sys::Uint8Array` and then fill that array with the writer's contents,
> the bytes in JS match the bytes in Rust, and `arrow.tableFromIPC` in JS
> works well.
>
> Possibly in relation to #1335, I was originally surprised why these
> original IPC file format files (with the unsafe view) were readable in
> Python, but not JS. From looking at the hexdump, I think the unsafe view
> corrupted the beginning of the file but not the end. So
> `pyarrow.ipc.open_file` was able to open the file likely because it first
> looked at the footer, while Arrow JS likely tries to parse stream and file
> IPC data in the same way.
>
> Kyle
>
> On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb <al...@influxdata.com> wrote:
>
>> Sorry Kyle, I totally missed this email
>>
>> Initially I would say the symptoms sound like "not calling finish() on
>> the writer" but I skimmed some of your linked code and saw at least one
>> call to finish, so maybe this is not the root cause
>>
>> In terms of reading from a parquet file and returning arrow, I would
>> recommend checking out the arrow module in the parquet[2]. The linked
>> documentation also includes an example.
>>
>> There is one existing issue[1] that sounds like it may be similar.
>>
>> Hope that helps,
>> Andrew
>>
>> [1] https://github.com/apache/arrow-rs/issues/1335
>> [2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html
>>
>> On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Kyle,
>>> I'm not sure if Rust contributors monitor this list, you might have
>>> better luck opening an issue on the Rust Repo [1]
>>>
>>> [1] https://github.com/apache/arrow-rs
>>>
>>> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <ky...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I've used Arrow a decent bit in Python and JS but I'm pretty new to
>>>> Rust. I'm trying to write a  minimal binding of Rust's Parquet to
>>>> WebAssembly in order to decode Parquet files to Arrow on the web. I have code
>>>> that works
>>>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>>>> only some of the time. For example this test data
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>>>  (created here
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>>>> seems to work with the js arrow.RecordBatchReader
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>>>  but other test data
>>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>>>  (created here
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>>>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>>>> 300.".
>>>>
>>>> Based on logging, it *seems* as if parsing the Parquet file goes
>>>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>>>> when trying to verify it). I'm currently trying to create the
>>>> StreamWriter
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>>>> then write all the Arrow RecordBatches into the writer
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>>>> then finish the writer
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>>>> and send the output back to JS
>>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>>>> .
>>>>
>>>> Has anyone seen a similar problem before, or any suggestions of where
>>>> to debug further? Alternatively, if an end-to-end example exists of reading
>>>> from a parquet file and returning an Arrow buffer would be very helpful to
>>>> see.
>>>>
>>>> Best,
>>>> Kyle Barron
>>>>
>>>>

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

Posted by Kyle Barron <ky...@gmail.com>.
Thanks to both!

I did more debugging last night and I believe the entire issue was `unsafe
{ Uint8Array::view(&file) }`
<https://github.com/kylebarron/parquet-wasm/blob/9aee64343b76c1c6b7550f9d27aede327f9a1b75/src/lib.rs#L140>
was unsafe 😄. I originally copied that from Dominik Moritz's
`arrow-wasm`'s `Table.serialize`
<https://github.com/domoritz/arrow-wasm/blob/3d6d4c6ab940fd317c4a19610cd204a06dc29584/src/table.rs#L58-L76>,
and just assumed it was ok usage. But when I instead create a new
`js_sys::Uint8Array` and then fill that array with the writer's contents,
the bytes in JS match the bytes in Rust, and `arrow.tableFromIPC` in JS
works well.

Possibly in relation to #1335, I was originally surprised why these
original IPC file format files (with the unsafe view) were readable in
Python, but not JS. From looking at the hexdump, I think the unsafe view
corrupted the beginning of the file but not the end. So
`pyarrow.ipc.open_file` was able to open the file likely because it first
looked at the footer, while Arrow JS likely tries to parse stream and file
IPC data in the same way.

Kyle

On Thu, Mar 10, 2022 at 4:24 AM Andrew Lamb <al...@influxdata.com> wrote:

> Sorry Kyle, I totally missed this email
>
> Initially I would say the symptoms sound like "not calling finish() on the
> writer" but I skimmed some of your linked code and saw at least one call to
> finish, so maybe this is not the root cause
>
> In terms of reading from a parquet file and returning arrow, I would
> recommend checking out the arrow module in the parquet[2]. The linked
> documentation also includes an example.
>
> There is one existing issue[1] that sounds like it may be similar.
>
> Hope that helps,
> Andrew
>
> [1] https://github.com/apache/arrow-rs/issues/1335
> [2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html
>
> On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Kyle,
>> I'm not sure if Rust contributors monitor this list, you might have
>> better luck opening an issue on the Rust Repo [1]
>>
>> [1] https://github.com/apache/arrow-rs
>>
>> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <ky...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> I've used Arrow a decent bit in Python and JS but I'm pretty new to
>>> Rust. I'm trying to write a  minimal binding of Rust's Parquet to
>>> WebAssembly in order to decode Parquet files to Arrow on the web. I have code
>>> that works
>>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>>> only some of the time. For example this test data
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>>  (created here
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>>> seems to work with the js arrow.RecordBatchReader
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>>  but other test data
>>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>>  (created here
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>>> 300.".
>>>
>>> Based on logging, it *seems* as if parsing the Parquet file goes
>>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>>> when trying to verify it). I'm currently trying to create the
>>> StreamWriter
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>>> then write all the Arrow RecordBatches into the writer
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>>> then finish the writer
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>>> and send the output back to JS
>>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>>> .
>>>
>>> Has anyone seen a similar problem before, or any suggestions of where to
>>> debug further? Alternatively, if an end-to-end example exists of reading
>>> from a parquet file and returning an Arrow buffer would be very helpful to
>>> see.
>>>
>>> Best,
>>> Kyle Barron
>>>
>>>

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

Posted by Andrew Lamb <al...@influxdata.com>.
Sorry Kyle, I totally missed this email

Initially I would say the symptoms sound like "not calling finish() on the
writer" but I skimmed some of your linked code and saw at least one call to
finish, so maybe this is not the root cause

In terms of reading from a parquet file and returning arrow, I would
recommend checking out the arrow module in the parquet[2]. The linked
documentation also includes an example.

There is one existing issue[1] that sounds like it may be similar.

Hope that helps,
Andrew

[1] https://github.com/apache/arrow-rs/issues/1335
[2] https://docs.rs/parquet/10.0.0/parquet/arrow/index.html

On Wed, Mar 9, 2022 at 2:03 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Kyle,
> I'm not sure if Rust contributors monitor this list, you might have better
> luck opening an issue on the Rust Repo [1]
>
> [1] https://github.com/apache/arrow-rs
>
> On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <ky...@gmail.com> wrote:
>
>> Hello!
>>
>> I've used Arrow a decent bit in Python and JS but I'm pretty new to Rust.
>> I'm trying to write a  minimal binding of Rust's Parquet to WebAssembly in
>> order to decode Parquet files to Arrow on the web. I have code that works
>> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
>> only some of the time. For example this test data
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>>  (created here
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
>> seems to work with the js arrow.RecordBatchReader
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>>  but other test data
>> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>>  (created here
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
>> raises with "Error: Expected to read 1249648 metadata bytes, but only read
>> 300.".
>>
>> Based on logging, it *seems* as if parsing the Parquet file goes
>> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
>> when trying to verify it). I'm currently trying to create the
>> StreamWriter
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
>> then write all the Arrow RecordBatches into the writer
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
>> then finish the writer
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
>> and send the output back to JS
>> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
>> .
>>
>> Has anyone seen a similar problem before, or any suggestions of where to
>> debug further? Alternatively, if an end-to-end example exists of reading
>> from a parquet file and returning an Arrow buffer would be very helpful to
>> see.
>>
>> Best,
>> Kyle Barron
>>
>>

Re: [Rust] Unable to read in Python or JS Arrow Stream IPC files written in Rust

Posted by Micah Kornfield <em...@gmail.com>.
Hi Kyle,
I'm not sure if Rust contributors monitor this list, you might have better
luck opening an issue on the Rust Repo [1]

[1] https://github.com/apache/arrow-rs

On Sun, Feb 27, 2022 at 7:28 PM Kyle Barron <ky...@gmail.com> wrote:

> Hello!
>
> I've used Arrow a decent bit in Python and JS but I'm pretty new to Rust.
> I'm trying to write a  minimal binding of Rust's Parquet to WebAssembly in
> order to decode Parquet files to Arrow on the web. I have code that works
> <https://github.com/kylebarron/parquet-wasm/blob/main/src/lib.rs> but
> only some of the time. For example this test data
> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/works.parquet>
>  (created here
> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/generate_data.py#L40-L43>)
> seems to work with the js arrow.RecordBatchReader
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/www/index.js#L50-L52>
>  but other test data
> <https://github.com/kylebarron/parquet-wasm/blob/9495a87e00ae7073966d171bdcbfa1b87c63991b/data/not_work.parquet>
>  (created here
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/data/generate_data.py#L45-L48>)
> raises with "Error: Expected to read 1249648 metadata bytes, but only read
> 300.".
>
> Based on logging, it *seems* as if parsing the Parquet file goes
> smoothly. It's only writing the Arrow IPC format that fails (on the JS side
> when trying to verify it). I'm currently trying to create the StreamWriter
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L122-L123>,
> then write all the Arrow RecordBatches into the writer
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L127-L128>,
> then finish the writer
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L142>,
> and send the output back to JS
> <https://github.com/kylebarron/parquet-wasm/blob/79580c64c698570fd1a8a48b55698ca0be630aa8/src/lib.rs#L145-L156>
> .
>
> Has anyone seen a similar problem before, or any suggestions of where to
> debug further? Alternatively, if an end-to-end example exists of reading
> from a parquet file and returning an Arrow buffer would be very helpful to
> see.
>
> Best,
> Kyle Barron
>
>