You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ramakrishna Prabhu (Jira)" <ji...@apache.org> on 2020/11/18 00:19:00 UTC

[jira] [Created] (ARROW-10635) ORC reader issue with bool column.

Ramakrishna Prabhu created ARROW-10635:
------------------------------------------

             Summary: ORC reader issue with bool column.
                 Key: ARROW-10635
                 URL: https://issues.apache.org/jira/browse/ARROW-10635
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 1.0.1
            Reporter: Ramakrishna Prabhu
         Attachments: bool_pq.parquet, broken_bool.zip

The ORC file contains single column of boolean type, from row number `20000` the values are mismatching compared to what is expected.

 

As per my observation, the writer used for this ORC file assumes RLE is aligned with row index boundaries. That means, no two row groups will share same byte. And there will be no offset within byte. But I think that pyarrow considers whatever leftover of that partial byte which was left at end of a row group as data which causes the shift in the values.

 

I have attached another parquet file with same data for reference. You would notice that Parquet considers last two bits of partial byte and shifts the data by two rows.

 
{code:java}
// code placeholder
{code}
from pyarrow import orc

f = orc.ORCFile('broken_bool.orc')

pdf_orc=f.read().to_pandas()

pdf_pq=pd.read_parquet("bool_pq.parquet") 

pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != pdf_pq.col_bool.dropna()]


20002 False
20004 False
20005 True
20007 False
20014 True
 ... 
21973 False
21974 False
21985 True
21988 True
21993 False

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)