You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Bobby Wang (Jira)" <ji...@apache.org> on 2022/01/06 08:54:00 UTC

[jira] (ORC-1075) Failed to read rows from the ORC file without statistics in RowIndex when filter is pushed down for 1.6.11

    [ https://issues.apache.org/jira/browse/ORC-1075 ]


    Bobby Wang deleted comment on ORC-1075:
    ---------------------------------

was (Author: wbo4958):
Thx, I just tested your patch. it works.

> Failed to read rows from the ORC file without statistics in RowIndex when filter is pushed down for 1.6.11
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1075
>                 URL: https://issues.apache.org/jira/browse/ORC-1075
>             Project: ORC
>          Issue Type: Bug
>          Components: Java, Reader
>    Affects Versions: 1.6.11
>            Reporter: Bobby Wang
>            Priority: Blocker
>         Attachments: none-1.orc
>
>
> I have attached an ORC file that seems not to include ColumnStatistics in RowIndex.
> {color:#FF0000}From the ORC spec, seems RowIndex.ColumnStatistics is not a required field ???{color}
>  
> {code:java}
> message RowIndexEntry {
>   repeated uint64 positions = 1 [packed=true];
>   optional ColumnStatistics statistics = 2;
> }
> message RowIndex {
>   repeated RowIndexEntry entry = 1;                                                        
> }
> {code}
> The meta of the ORC file
>  
> {code:java}
> $ orctools meta none.orc 
> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
> Processing data file none.orc [length: 124]
> Structure for none.orc
> File Version: 0.12 with ORIGINAL
> Rows: 3
> Compression: NONE
> Calendar: Julian/Gregorian
> Type: struct<INT:int>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 3 hasNull: true
>     Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6
> File Statistics:
> Stripes:
>   Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10
>     Stream: column 0 section ROW_INDEX start: 3 length 4
>     Stream: column 1 section ROW_INDEX start: 7 length 6
>     Stream: column 1 section DATA start: 13 length 4
>     Encoding column 0: DIRECT
>     Encoding column 1: DIRECT_V2
> File length: 124 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
> {code}
>  
> the data of the orc file
> {code:java}
> $ orctools data none.orc 
> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
> Processing data file none.orc [length: 124]
> {"INT":1}
> {"INT":2}
> {"INT":3}{code}
> I have below code trying to read each row of the ORC file
> {code:java}
> // Pick the schema we want to read using schema evolution
> TypeDescription readSchema =
> TypeDescription.fromString("struct<INT:int>");
> // Get the information from the file footer
> Reader reader = OrcFile.createReader(new Path("none.orc"),
>                 OrcFile.readerOptions(new Configuration()));
> System.out.println("File schema: " + reader.getSchema());
> System.out.println("Row count: " + reader.getNumberOfRows());
> RecordReader rowIterator = reader.rows(
>  reader.options()
>      .schema(readSchema)
>      .searchArgument(SearchArgumentFactory.newBuilder()
>          .equals("INT", PredicateLeaf.Type.LONG, 2L)
>      .build(), new String[]{"INT"}) //predict push down
> );
> // Read the row data
> VectorizedRowBatch batch = readSchema.createRowBatch();
> LongColumnVector x = (LongColumnVector) batch.cols[0];
> while (rowIterator.nextBatch(batch)) {
>   System.out.println(batch.size);
>   for (int row = 0; row < batch.size; ++row) {
>     int xRow = x.isRepeating ? 0 : row;
>     System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ?    
>                   x.vector[xRow] :null));
>   }
> }
> rowIterator.close();{code}
>  
> h2. output from 1.6.11
> File schema: struct<INT:int>
> Row count: 3
> h2. output from 1.5.10
> File schema: struct<INT:int>
> Row count: 3
> 3
> INT: 1
> INT: 2
> INT: 3
>  
> Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while there is no such issue on spark 3.0.x which depends on ORC 1.5.10
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)