You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Brian Hulette (JIRA)" <ji...@apache.org> on 2018/12/04 03:30:00 UTC

[jira] [Commented] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

    [ https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708149#comment-16708149 ] 

Brian Hulette commented on ARROW-3667:
--------------------------------------

Makes sense, thanks for the context.
Maybe I'll start a discussion on the mailing list to define how we represent the null datatype in JSON.

> [JS] Incorrectly reads record batches with an all null column
> -------------------------------------------------------------
>
>                 Key: ARROW-3667
>                 URL: https://issues.apache.org/jira/browse/ARROW-3667
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: JS-0.3.1
>            Reporter: Brian Hulette
>            Priority: Major
>             Fix For: JS-0.4.0
>
>
> The JS library seems to incorrectly read any columns that come after an all-null column in IPC buffers produced by pyarrow.
> Here's a python script that generates two arrow buffers, one with an all-null column followed by a utf-8 column, and a second with those two reversed
> {code:python}
> import pyarrow as pa
> import pandas as pd
> def serialize_to_arrow(df, fd, compress=True):
>   batch = pa.RecordBatch.from_pandas(df)
>   writer = pa.RecordBatchFileWriter(fd, batch.schema)
>   writer.write_batch(batch)
>   writer.close()
> if __name__ == "__main__":
>     df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 'def', 'ghi']}, columns=['nulls', 'not nulls'])
>     with open('bad.arrow', 'wb') as fd:
>         serialize_to_arrow(df, fd)
>     df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
>     with open('good.arrow', 'wb') as fd:
>         serialize_to_arrow(df, fd)
> {code}
> JS incorrectly interprets the [null, not null] case:
> {code:javascript}
> > var arrow = require('apache-arrow')
> undefined
> > var fs = require('fs')
> undefined
> > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0)
> 'abc'
> > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
> '\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000'
> {code}
> Presumably this is because pyarrow is omitting some (or all) of the buffers associated with the all-null column, but the JS IPC reader is still looking for them, causing the buffer count to get out of sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)